A Pipeline for Multimedia Twitter Analysis through Graph Databases:
Preliminary Results
Roberto Boselli
1,3
, Mirko Cesarini
1,3
, Fabio Mercorio
1,3
, Mario Mezzanzanica
1,3
and Alessandro Vaccarino
2
1
Dep. of Statistics and Quantitative Methods, University of Milan-Bicocca, 8 via Bicocca degli Arcimboldi, Milan, Italy
2
Aubay Italia SpA, 2 Largo la Foppa, Milan, Italy
3
CRISP Research Centre - University of Milan-Bicocca, Milan, Italy
Keywords:
Graph-Database, Social Network Analysis, Microblogging.
Abstract:
Twitter is a microblogging service where users post not only short messages, but also images and other mul-
timedia contents. Twitter can be used for analyzing people public discussions, as a huge amount of messages
are continuously broadcasted by users. Analysis have usually focused on the textual part of messages, but
the non-negligible number of images exchanged calls for specific attention. In this paper we describe how the
tweet multimedia contents can be turned into a knowledge graph and then used for analyzing the messages sent
during marketing campaigns. The information extraction and processing pipeline is built on top of off-the-
shelf APIs and products while the obtained knowledge is modelled through a Graph Database. The resulting
knowledge graph was useful to explore and identify similarities among different marketing campaigns carried
out using Twitter, providing some preliminary but promising results.
1 INTRODUCTION AND
MOTIVATION
On the Twitter microblogging service, users post mil-
lions of short messages daily, often including pictures
and videos. The number of tweets containing multi-
media contents is rapidly growing, mainly due to the
diffusion of several services that allow users to spread
contents over Twitter directly from other external so-
cial media platforms e.g., Facebook, LinkedIn, and
Instagram.
In such a context, we believe images related to
tweet’s texts can be effectively used to better investi-
gate the tweet’s informative content by extracting and
recognising common features within pictures, such as
emotional face attributes, close-up people, image spe-
cific colours, objects, brand logos, etc. This, in turn,
would contribute to understand how images attached
to tweets are used in marketing campaigns, identify-
ing relationships among image features in different
social campaigns too.
Some existing approaches are aimed at extracting
images from Twitter related to some real-life events
e.g., (Kaneko and Yanai, 2016), as well as to visu-
alise them using Twitter geo-tags (Yanai et al., 2014).
On the other side, several studies use SNA on so-
cial media data, in particular Twitter data, to analyze
user behaviours and interactions in several contexts,
ranging from marketing to social community manage-
ment (Java et al., 2007; Ediger et al., 2010; Cheong
and Cheong, 2011).
Differently, our approach is aimed at building a
knowledge-graph of tweets’ photo features. Then,
we query the resulting graph database for identify-
ing common patterns within tweets, to analyse how
a brand marketing campaign has been conducted.
Indeed, Social Media Marketing (SMM) is the
evolution of traditional marketing process where
brands look for both visibility and dialogue with con-
sumers using social media. SMM allows brands to in-
teract with customers more directly and equally, e.g.
through a stream of tweets in Twitter or comments
in Facebook. Interactions and comments generate
the so-called ”engagement”, which allows brands to
get feedback, opinions, advice, review etc. (Hoff-
man and Fodor, 2010). SMM offers to consumers
the opportunity to express themselves without inter-
mediaries, while it allows brands to listen and meet
the customer’s needs, and let them also get involved
in the marketing projects (crowdsourcing). The mes-
Boselli, R., Cesarini, M., Mercorio, F., Mezzanzanica, M. and Vaccarino, A.
A Pipeline for Multimedia Twitter Analysis through Graph Databases: Preliminary Results.
DOI: 10.5220/0006490703430349
In Proceedings of the 6th International Conference on Data Science, Technology and Applications (DATA 2017), pages 343-349
ISBN: 978-989-758-255-4
Copyright © 2017 by SCITEPRESS Science and Technology Publications, Lda. All rights reserved
343
sages exchanged via Twitter offer companies new
ideas and insights to promote web marketing cam-
paigns and to improve consumer relationships. Each
activity on social media must be measured with the
right Social Media Analytic metrics according to spe-
cific analytical objectives e.g., the sentiment analy-
sis or the engagement rate calculation (Peters et al.,
2013). Brands are constantly looking on new ways to
extract information from Twitter messages and Social
Media interactions in general.
From a practical perspective, graph-based appli-
cations have yet proved their powerfulness in real-
life application, such as social network analyses (Lin
et al., 2012; Dries et al., 2009; Zou et al., 2009;
Amato et al., 2017; Appel and Moyano, 2017), data
cleaning (Boselli et al., 2013; Mezzanzanica et al.,
2013; Mezzanzanica et al., 2015a), biology (Eckman
and Brown, 2006; Benhiba et al., 2017), Web min-
ing (Schenker et al., 2005), graph-exploration (Della
Penna et al., 2010; Mercorio, 2013) and semantic
Web (Hayes and Gutierrez, 2004; Zeng et al., 2013),
healthcare (Boselli et al., 2014), just to cite a few.
Paper’s Goal. In this paper we present a system that
automatically retrieves tweets and builds a knowl-
edge graph about the attached multimedia contents
for analysis purposes. Concisely, the system collects
tweets and attached images sent from a given list of
user accounts, then it employs off-the-shelf machine-
learning and vision APIs to identify the images main
elements and features. Finally, it builds a knowledge
graph to perform SNAs and graph-traversal queries
as well. Our framework has been designed, imple-
mented, and then tested focusing on the use-case sce-
nario of a marketing campaign performed by four
well-known fast-food restaurants. We also provide
some preliminary but promising results.
2 Graph-DB AT A GLANCE
In recent years the well-known NoSQL movement
brought to the fore new data model paradigms that
differ significantly with respect to the classical rela-
tional model (e.g., key-values, document databases,
column-oriented and graph-db). For further details
on NoSQL databases the interested reader can refer
to (Han et al., 2011; Cattell, 2011).
All these new paradigms share some interest-
ing features compared with classical relational model
(see, e.g. (Stonebraker, 2010)), such as a flexible
schema that can evolve to always fit the data, the abil-
ity to horizontally scale with ease as well as the na-
tive support for sharing. However, most of NoSQL
databases store sets of disconnected documents and
values (aka aggregate models), as in the case of key-
value and document databases. This in turn makes
it difficult to use them for connected data. Differ-
ently, graph databases (see, (Angles and Gutierrez,
2008)) present three distinct characteristics useful for
our purposes: (i) they allow for a natural modelling
of the data, through nodes and edges between nodes.
Each node might contain attributes in a schema-free
fashion while edges enable the possibility to con-
nect entities with labels and properties; (ii) queries
can be performed directly on the graph, thus inherit-
ing a huge amount of efficient algorithms and well-
formalized problems on graphs (e.g., shortest-path,
A*, SNA metrics, etc.); (iii) in recent years, graph-
databases are growing in importance in both industrial
and academic communities, making available a num-
ber of (stable enough) solutions for storing and query-
ing graphs in real-life situations (e.g., Neo4j (Web-
ber, 2012), OrientDB (Tesoriero, 2013), Titan (Titan,
2017), just to cite a few).
Formally speaking, a graph is a pair G = (V,E)
where V is a finite set of nodes and E a set of un-
ordered pairs (u,v) E, the so-called edges. Two
nodes u and v are adjacent if the pair (u,v) E. Al-
though a graph-structure is simple and well-known,
a graph-store can be implemented in several ways, as
the case of Neo4j. Neo4j is a property graph database,
this means it is an attributed, labelled, directed multi-
graph
1
. Neo4j is composed of four building blocks:
Labels associate a common name to a set of Nodes
or Relationships to allow for fast indexing. From
a conceptual perspective, labels can be seen as a
construct to model the E/R model hierarchies, as
a node/relation can have more than one label at a
time.
Nodes represent a tuple in a common relational
database, each of which can vary in length hav-
ing different data types;
Relationships can exist between nodes. In a prop-
erty graph model we can represent directed binary
relations, that always have a start and end node.
Properties are key-values pair that can be included
into any nodes or attached to any relationships.
1
A multi-graph is a graph where multiple edges between
two nodes are permitted and might be specified through la-
bels.
KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -
methodologies, techniques and tools
344
3 THE PIPELINE
ARCHITECTURE
All the process is designed to grant versatility while
analysing different domains and scenarios.
The whole architecture has been scheduled through
the ETL Talend (Bowen, 2012) orchestration compo-
nent that allows to design, schedule and monitor each
phase of the system developed. Talend layer is also
used to handle JSON interface files between different
phases of the process.
A Python layer is responsible of interfacing with the
API of Twitter and Google Cloud Vision , and to
interact with Neo4j.
In Fig. 1 we show both the architecture and the
pipeline workflow, the latter mainly composed of 4
phases, namely:
Figure 1: System Architecture.
1. Phase. Collecting Tweets. Public Twitter APIs
have been employed via Python calls to collect
tweets using as input (1) a specific user account
of interest, and (2) a list of keywords. A JSON
file is returned for each tweet retrieved
2
. The
dataset includes, among other, some information
relevant for our purposes, namely: (i) tweet
data, that include tweet ID, text, date and time
of publication, geolocation and retweet details;
(ii) user data, such as ID, followers count and
location; (iii) the hashtag list and (iv) the url list
of the attached tweet images (if any).
All these information are stored locally through
TALEND for being used by the next phase.
2. Phase. Downloading Images. The pipeline scans
each url attached to a tweet, and downloads the
corresponding image. Clearly, if several tweets
contain the same image file (i.e., the tweet is a re-
tweet) the image is downloaded only once.
2
The interested reader can refer to the official
Twitter API documentation for further details at
https://dev.twitter.com/rest/public/search
3. Phase. Image Processing. All the images col-
lected at the previous step are stored on a cloud
service for easily interact with Google’s Cloud
Vision platform through proprietary REST API,
called Google Cloud Vision API. The latter per-
forms an image content analysis to automatically
recognise specific items, such as objects, faces,
known logos, text, colours, and the sentiments re-
lated to the photo too. According to the official
documentation
3
the Google service employs ma-
chine learning algorithms for classifying, among
other:
the close-up entity labelled, that is computed
from a wide range of object categories (e.g.,
car, dog, face, etc.). A confidence score is also
attached to each entity recognised within the
image;
the OCR recognition is able to retrieve texts
(automatically detecting the language) and
brand logos;
the properties detection feature allows identify-
ing the set of characteristics of a picture, such
as its dominant colour in RGB format;
the facial detection feature can detect whether
faces appear in images or not, and a set of eight
emotional facial attributes of the identified peo-
ple like joy, sorrow, and anger.
4. Phase. Building the Knowledge Graph. All
the information about tweets (gathered in Phase
1) along with the image features collected from
Phase 2 and 3, are arranged into a graph-database
model. For the sake of completeness here we re-
port both the E/R model and the Graph-DB model
respectively in Fig. 2 and Fig. 3. As one might
note, entities of the E/R corresponds to labels in
the property graph models, whilst relationships
are modelled as edges.
4 PRELIMINARY
EXPERIMENTAL RESULTS
In this preliminary experimental phase we analysed
the images referenced by 1,000 tweets sent from four
distinct burger restaurant accounts, namely: Mc Don-
ald’s, Burger King, Taco Bell and KFC. In Tab. 1 we
report the statistics of the resulting knowledge graph
as modelled in Neo4j.
The goal of this experimental evaluation was to
assess the effectiveness of the proposed approach in a
sandbox environment.
3
https://cloud.google.com/vision
A Pipeline for Multimedia Twitter Analysis through Graph Databases: Preliminary Results
345
Table 1: Graph statistics.
Nodes Edges
Photo Tweet User HashTag Color Label FaceEmotion :Contains :Face Emotion :Attach :Has :Post :Color :Jaccard
710 940 841 727 450 587 7 4,640 1,021 945 2,183 940 2,501 6,010
Figure 2: Data Model of the Knowledge extracted from the
twitted multimedia content.
Figure 3: The System Data Model.
Computing Jaccard Similarities. Analysing the
Graph DB obtained using the pipeline described in
Sec. 3, we computed the Jaccard Index (Real and Var-
gas, 1996) (as a similarity relationship) on pairs of
Label nodes (Cypher Query 1). It is worth recalling
that label nodes represent elements found within im-
ages by the Google Vision API. The Jaccard Index
between two labels is computed as the number of im-
ages having both labels over the number of images
that have at least one of the elements. Let A be the set
of images having element a and let B the set of images
having element b, the Jaccard Index is computed as
J
a,b
= |A B|/|A B| (1)
This analysis allows one to identify recurrent ele-
ments within the photos published by a vendor and
to derive insights about the marketing plan adopted.
By looking ad the graph in Fig. 5(a) it is evident that
McDonald’s focuses more on outdoor images with re-
spect to the competitors. At the same time, Burger
King focuses more on sports related images as out-
lined in Fig. 5(b). This analysis is useful to under-
stand the features that brands use in their marketing
campaign and how these features are related as well.
Figure 4: Example of a high related Clique.
Local Clustering Coefficient. (Watts and Strogatz,
1998) is a measure that allows computing the de-
gree to which nodes in a graph tend to cluster to-
gether (aka transitivity coefficient). A Local Cluster-
ing Coefficient was computed above the Jaccard In-
dex over hashtags to detect hashtags that are always
used together, and their identification is valuable since
it allows analysts to discover interesting relationships
among the topics dealt in tweets. Let u V (G) a node
of the network, the local clustering coefficient for u is
an index defined as
lc
u
=
|{e
vw
E : e
uv
E,e
uw
E}|
n
u
(n
u
1)
(2)
that is the ratio between the number of triangles
(3-cliques) and the maximum number of triangles in
which the node could be involved, that depends on the
number of nodes in the neighbourhood of u, namely
n
u
= |{v
j
: e
i j
E e
ji
E}| (3)
KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -
methodologies, techniques and tools
346
Mc Donald’s
(a)
Burger King’s
(b)
Figure 5: Mc Donald’s and Burger King’s Jaccard Indexes.
The Jaccard Index was computed also among
hashtags. Let H be the set of tweets having the hash-
tag h and let K the set of tweets having the hashtag k,
the Jaccard Index between the hashtags k and h is
J
h,k
= |H K|/|H K| (4)
In Fig. 4 an example of Clique is showed where the
Jaccard Index is equal to 1 among all the involved
hashtags while the Cypher Query 2 shows the code
used for computing the local clustering coefficient for
each hashtag pairs.
Finally, we can also analyse features relevance for
each vendor, as we show in Fig. 6. This allowed us
to have a bird-eye-view of the main features that each
vendor exploited in its campaign.
The just outlined analyses are an example of the
insights that can be identified analysing twitter con-
tents and images using a knowledge graph. Further-
more, the pipeline presented in this paper is general
enough to be used in different scenarios too.
Cypher Query 1: Compute Jaccard Similarity between
labels
MA TC H ( p : P ho to ) -[ r ] - ( l : L ab el )
WH ER E l . L a be l_N am e = X
WI TH C OU NT ( di s ti nc t r) AS X_ l ab _ co u nt e r
MA TC H ( p : P ho to ) -[ r ] - ( l : L ab el )
WH ER E l . L a be l_N am e = Y
WI TH X _l ab _ co un te r , C OU NT ( dis ti nct r) AS Y_l ab_ co u nt e r
MA TC H ( l 1 : La be l ) -[] - ( p : P ho to ) -[ ] -( l2 : La be l )
WH ER E l1 . L abe l_ N am e = X AN D l2 . La b el _ Na me = Y
WI TH X _l ab _ co un te r , Y _l a b_ co un ter , C OU NT ( d is ti n ct p) AS
co u nt er 3
WI TH ( X _l a b_ c ou n te r + Y_ l ab _ co unt er ) - cou nt er3 AS Uni on e
MA TC H ( l 1 : La be l ) -[] - ( p : P ho to ) -[ ] -( l2 : La be l )
WH ER E l1 . L abe l_ N am e = X AN D l2 . La b el _ Na me = Y
WI TH Un ion , C OU NT ( di s ti nc t p) AS In t er se c t
RE TU RN ( I n te rs e ct * 1. 0/ Un ion ) A S Ja cc ar d
Cypher Query 2: Local Clustering Coefficient assign-
ment and Clique detection
MA TC H ( a : H as ht a g ) -[: JA C CA R D_ L INK _HA S HT A G ] -( b : H a sh ta g )
WI TH a , col le ct (b) a s sn , count ( di st i nc t b ) as n ,( co un t (
di s ti nc t b) * ( c ou nt ( di sti nc t b ) -1) ) /2 as nk
MA TC H ( a ) - [ : J ACC A RD _ LI N K_ H ASH T AG ] -( b1) -[ rel :
JAC CAR D _L I NK _ HA S HTA G ] - ( b2 ) -[: JAC CAR D _L I NK _ HA S HTA G
] -( a )
WI TH a , sn , n , nk , c ou nt ( dis ti nct re l ) as r , t oF lo at (
co un t ( d i st in c t rel ) ) / t oF loa t ( nk ) as coef
WH ER E co ef = 1
WI TH a , sn
FO REA CH (c in RA NG E (0 , si ze (s n ) -1 ) | F OR EAC H ( n1 in [ sn[c
]] | Set n1 . Cl iqu eI d = i d (a ) , a. Cl iqu eI d = id(a ) ))
MA TC H ( a : H as ht a g ) R ETU RN d is t in ct a . Cl iq ue Id , c ou nt (a )
as Cli qu e Di m OR DE R BY Cl i qu eDi m de sc
5 CONCLUSIONS AND
EXPECTED OUTCOMES
Traditionally, analyses on Twitter messages have fo-
cused on textual contents while attached images have
been hardly considered. Nevertheless, extracting in-
formation over the pictures exchanged via twitter is a
rich source of valuable data and meaningful analysis
can be performed thereon. Nowadays, off-the-shelf
image processing APIs can extract several interesting
information from pictures e.g., dominant colours and
logos, people faces, and emotional facial attributes
like joy, sorrow, and anger.
The data extracted from images can be coupled
with other (more traditional) information extracted
from twitter messages (e.g., the relationships among
users, exchanged messages, and hashtags). A knowl-
edge graph can be built thereon and then used for sub-
sequent analysis.
The information extraction and analysis pipeline
described in this paper was used to investigate
A Pipeline for Multimedia Twitter Analysis through Graph Databases: Preliminary Results
347
Figure 6: Correlation between Photo tag & Vendor (confidence higher than 97%).
the marketing campaigns performed by four burger
restaurants on a small but significant dataset. The
graph database that hosted the knowledge graph has
proved to be suitable for exploring and identifying
interesting relationships among the images used for
each campaign.
Images are very frequently attached to Twitter
messages and they convey some additional informa-
tion that are not present in text messages. The infor-
mation extraction pipeline and the knowledge graph
presented is general enough to be used for several
twitter analysis tasks, non limited to the specific case
presented in this paper. We are now working on
applying this approach to a wide number of Tweets
in different marketing campaigns, implementing the
pipeline over a big data architecture to scale-out ef-
fectively. Furthermore, we also intend to build a
graph-based model for reasoning with labour mar-
ket data, both structured (Mezzanzanica et al., 2011;
Mezzanzanica et al., 2015b) and unstructured (see,
e.g., (Amato et al., 2015)) that would allow imple-
menting graph-traversal queries and SNA metrics as
well.
ACKNOWLEDGMENT
The authors would like to thank Dr. Carla Marini for
her invaluable assistance and thoughtful discussions
while working on this project.
REFERENCES
Amato, F., Boselli, R., Cesarini, M., Mercorio, F., Mezzan-
zanica, M., Moscato, V., Persia, F., and Picariello, A.
(2015). Challenge: Processing web texts for classify-
ing job offers. In Semantic Computing (ICSC), 2015
IEEE International Conference on, pages 460–463.
Amato, F., Moscato, V., Picariello, A., and Sperl
´
ı, G.
(2017). Influence Maximization in Social Media Net-
works Using Hypergraphs, pages 207–221. Springer
International Publishing, Cham.
Angles, R. and Gutierrez, C. (2008). Survey of graph
database models. ACM Computing Surveys (CSUR),
40(1):1.
Appel, A. P. and Moyano, L. G. (2017). Link and graph
mining in the big data era. In Handbook of Big Data
Technologies, pages 583–616. Springer.
Benhiba, L., Loutfi, A., and Idrissi, M. A. J. (2017). A clas-
sification of healthcare social network analysis appli-
cations. BIOSTEC 2017, page 147.
Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzan-
ica, M. (2013). Inconsistency knowledge discovery
for longitudinal data management: A model-based ap-
proach. In SouthCHI13 special session on Human-
Computer Interaction & Knowledge Discovery, Lec-
ture Notes in Computer Science, vol. 7947. Springer.
Boselli, R., Cesarini, M., Mercorio, F., and Mezzanzanica,
M. (2014). A Policy-Based Cleansing and Integration
Framework for Labour and Healthcare Data, pages
141–168. Springer Berlin Heidelberg, Berlin, Heidel-
berg.
Bowen, J. (2012). Getting Started with Talend Open Studio
for Data Integration. Packt Publishing Ltd.
Cattell, R. (2011). Scalable sql and nosql data stores. Acm
Sigmod Record, 39(4):12–27.
KomIS 2017 - Special Session on Knowledge Discovery meets Information Systems: Applications of Big Data Analytics and BI -
methodologies, techniques and tools
348
Cheong, F. and Cheong, C. (2011). Social media data min-
ing: A social network analysis of tweets during the
australian 2010-2011 floods. In Seddon, P. B. and Gre-
gor, S., editors, 15th Pacific Asia Conference on In-
formation Systems (PACIS), pages 1–16. Queensland
University of Technology.
Della Penna, G., Intrigila, B., Magazzeni, D., and Merco-
rio, F. (2010). A PDDL+ benchmark problem: The
batch chemical plant. In Proceedings of the The 20th
International Conference on Automated Planning and
Scheduling (ICAPS 2010), pages 222–225, Toronto,
Canada. AAAI Press.
Dries, A., Nijssen, S., and De Raedt, L. (2009). A query
language for analyzing networks. In Proceedings of
the 18th ACM conference on Information and knowl-
edge management, pages 485–494. ACM.
Eckman, B. A. and Brown, P. G. (2006). Graph data man-
agement for molecular and cell biology. IBM journal
of research and development, 50(6):545–560.
Ediger, D., Jiang, K., Riedy, J., Bader, D. A., Corley, C.,
Farber, R. M., and Reynolds, W. N. (2010). Mas-
sive social network analysis: Mining twitter for social
good. In 39th International Conference on Parallel
Processing, pages 583–593.
Han, J., Haihong, E., Le, G., and Du, J. (2011). Survey on
nosql database. In Pervasive computing and applica-
tions (ICPCA), 2011 6th international conference on,
pages 363–366. IEEE.
Hayes, J. and Gutierrez, C. (2004). Bipartite graphs as in-
termediate model for rdf. In International Semantic
Web Conference, pages 47–61. Springer.
Hoffman, D. L. and Fodor, M. (2010). Can you measure
the roi of your social media marketing? MIT Sloan
Management Review, 52(1):41.
Java, A., Song, X., Finin, T., and Tseng, B. (2007). Why
we twitter: Understanding microblogging usage and
communities. In Proceedings of the 9th WebKDD and
1st SNA-KDD 2007 Workshop on Web Mining and
Social Network Analysis, WebKDD/SNA-KDD ’07,
pages 56–65, New York, NY, USA. ACM.
Kaneko, T. and Yanai, K. (2016). Event photo mining
from twitter using keyword bursts and image cluster-
ing. Neurocomputing, 172:143–158.
Lin, C.-Y., Wu, L., Wen, Z., Tong, H., Griffiths-Fisher,
V., Shi, L., and Lubensky, D. (2012). Social net-
work analysis in enterprise. Proceedings of the IEEE,
100(9):2759–2776.
Mercorio, F. (2013). Model checking for universal planning
in deterministic and non-deterministic domains. AI
Communications, 26(2):257–259.
Mezzanzanica, M., Boselli, R., Cesarini, M., and Merco-
rio, F. (2011). Data quality through model checking
techniques. In Gama, J., Bradley, E., and Hollm
´
en,
J., editors, Intelligent Data Analysis (IDA), Lecture
Notes in Computer Science vol. 7014, pages 270–281.
Springer.
Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio,
F. (2013). Automatic synthesis of data cleansing ac-
tivities. In Helfert, M., Francalanci, C., and Filipe, J.,
editors, DATA 2013 - the International Conference on
Data Technologies and Applications, pages 138–149.
SciTePress.
Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio,
F. (2015a). A model-based approach for developing
data cleansing solutions. Journal of Data and Infor-
mation Quality (JDIQ), 5(4):13.
Mezzanzanica, M., Boselli, R., Cesarini, M., and Mercorio,
F. (2015b). A model-based evaluation of data quality
activities in KDD. Information Processing & Man-
agement, 51(2):144–166.
Peters, K., Chen, Y., Kaplan, A. M., Ognibeni, B., and
Pauwels, K. (2013). Social media metrics. a frame-
work and guidelines for managing social media. Jour-
nal of interactive marketing, 27(4):281–298.
Real, R. and Vargas, J. M. (1996). The probabilistic basis
of jaccard’s index of similarity. Systematic biology,
45(3):380–385.
Schenker, A., Kandel, A., Bunke, H., and Last, M. (2005).
Graph-theoretic techniques for web content mining,
volume 62. World Scientific.
Stonebraker, M. (2010). Sql databases v. nosql databases.
Communications of the ACM, 53(4):10–11.
Tesoriero, C. (2013). Getting Started with OrientDB. Packt
Publishing Ltd.
Titan (2017). Distributed graph database.
http://titan.thinkaurelius.com/.
Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics
of small-worldnetworks. nature, 393(6684):440–442.
Webber, J. (2012). A programmatic introduction to neo4j.
In Proceedings of the 3rd annual conference on Sys-
tems, programming, and applications: software for
humanity, pages 217–218. ACM.
Yanai, K., Kaneko, T., and Kawano, Y. (2014). Real-time
photo mining from the twitter stream: event photo
discovery and food photo detection. In Multimedia
(ISM), 2014 IEEE International Symposium on, pages
295–302. IEEE.
Zeng, K., Yang, J., Wang, H., Shao, B., and Wang, Z.
(2013). A distributed graph engine for web scale rdf
data. In Proceedings of the VLDB Endowment, vol-
ume 6, pages 265–276. VLDB Endowment.
Zou, L., Chen, L., and
¨
Ozsu, M. T. (2009). Distance-join:
Pattern match query in a large graph database. Pro-
ceedings of the VLDB Endowment, 2(1):886–897.
A Pipeline for Multimedia Twitter Analysis through Graph Databases: Preliminary Results
349