i-SLOD: Towards an Infrastructure for Enabling the Dissemination

and Analysis of Sentiment Data

Rafael Berlanga, Dolores Mª Llidó, Lisette García, Victoria Nebot,

María José Aramburu and Ismael Sanz

Universitat Jaume I, Avda. Vicent Sos Baynat s/n, Castellón de la Plana, Spain

Keywords: Opinion Analysis, Open Linked Data, Business Intelligence.

Abstract: This paper proposes a new data infrastructure for massive opinion analysis, called i-SLOD, from a Business

Intelligence (BI) perspective. This infrastructure aims to allow analysts to re-use the existing review data

about products and services publicly available in the Web. It should also take advantage from the external

relationships of i-SLOD data in order to perform new exploratory analyses now unfeasible with traditional

BI tools. We consider the adoption of Linked Open Data (LOD) technology to build this infrastructure. In

this way, i-SLOD data will be published as distributed linked open data by using the RDF and OWL

formats. Moreover, we propose to apply automatic semantic annotation to perform the basic tasks in i-

SLOD, mainly the extraction of opinion facts from raw text, and linking opinion data to the i-SLOD and

other related LOD datasets.

1 INTRODUCTION

The massive publication of opinions about product

and services has produced a burst of methods for

sentiment analysis (Liu, 2012). Most of these

approaches directly deal with the review texts to

identify global assessments (reputation) of certain

products and services. They are mainly focused on

detecting the subject of the opinion (e.g., some

product or some aspect of it) as well as the

orientation of the opinion (i.e., polarity). Massive

mining of opinions allow obtaining good indicators

about the Voice of the Market (García-Moya et al.,

2013a). Due to the high interest of this kind of data,

a good number of commercial tools have recently

appeared in the market, for example Swotti, Radian6

Insight, Media Miser, Scout Labs, Wise Window

and Sinthesio, to mention a few. Unfortunately, most

of these tools just provide web reports targeted to

end-users, and the sentiment data is not publicly

available for third party applications.

Apart from the sentiment analysis approaches,

there is also a great interest on publishing strategic

data for Business Intelligence (BI) tasks within the

Linked Open Data (LOD) cloud (Heath and Bizer,

2011). Initatives like Schema.org are allowing the

massive publication of product offers as microdata,

as well as specific vocabularies for e-commerce

applications. Unfortunately, both worlds, sentiment

data and LOD technology, have kept unconnected to

each other until recently. Some preliminary projects

such as MARL (Westerski and Iglesias, 2011)

attempt to provide standarized schemas for

expressing opinion data as linked data. However,

nowadays there is no open data infrastructure that

allows users and applications to directly perform

analysis tasks over huge amounts of published

opinions in the Web.

In this paper, we propose i-SLOD, a new data

infrastructure for sentiment data aimed at satisfying

the necessity of generating and analysing opinion

data from a BI perspective in the context of the LOD

initative.

2 i-SLOD ROAD MAP

Traditional BI assumes the existence of a controlled

set of data sources, from which summarized data is

obtained for decision making tasks. BI architectures

usually rely on a data warehouse defined under a

multidimensional model (i.e., just consisting of

measures and dimensions), which is fed with data

extracted from existing data sources by applying the

214

Berlanga R., M

Llidó D., García L., Nebot V., José Aramburu M. and Sanz I..

i-SLOD: Towards an Infrastructure for Enabling the Dissemination and Analysis of Sentiment Data.

DOI: 10.5220/0004627902140219

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval and the International Conference on Knowledge

Management and Information Sharing (KDIR-2013), pages 214-219

ISBN: 978-989-8565-75-4

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

so-called Extraction, Transform and Load (ETL)

processes. Finally, data is summarized by applying

efficient BI tools such as OLAP.

From a BI point of view, opinion data can be

also multidimensionally modelled and analysed. For

example, the reputation of a product, the most

outstanding features of some product brand, or the

opined aspects can be efficiently computed with

OLAP-like operations (García-Moya et al., 2013a).

The main BI e-commerce patterns we consider in

this project are summarized in Figure 1. Facts such

as sales, offers and opinions account for spatio-

temporal observations of some measure (e.g., units

sold, units offered, number of positive reviews, and

so on), whereas dimensions (labelled with ‘D’)

account for the contexts of such observations.

Dimensions can provide different detail levels

(labelled with ‘L’). In this paper we will mainly

focus on the specification and generation of both

review and opinion facts. Notice that every review

produces two kind of sentiment facts: the global

review assessment about the item (review fact), and

the specific criticisms to the item features/aspects

(opinion facts).

Figure 1: Main BI patterns over e-commerce facts.

In order to cover these patterns, the main

components of i-SLOD data infrastructure are linked

to each other as well as to other external related

LOD datasets. Figure 2 shows the proposed

architecture, where i-SLOD components are placed

within the inner ring. The outer ring contains other

LOD datasets and vocabularies (dotted boxes) that

can be linked to the proposed infrastructure in order

to enrich or perform exploratory BI.

Every i-SLOD component consists of a series of

RDF-triples datasets regarding some of the

perspectives we consider relevant for BI over

sentiment data. As proposed in LOD, links between

datasets are expressed with “

owl:sameAs

” statements.

Links to external datasets like DBpedia play a very

relevant role in this infrastructure since they can

enormously facilitate the migration of existing

review and opinion data. For example, reviews

already containing microdata referring to some

product in DBpedia will be automatically assigned

to the product URI of the corresponding i-SLOD

product dataset.

Figure 2: Main components of I-SLOD, and their relation

to existing LOD vocabulary and data sets.

Regarding the nature of the data to be published in

this infrastructure, we have identified some basic

requirements in order to make published data useful

in a real BI scenario:

 Support classification of sentiment data through

taxonomical relationships.

 Support massive generation of opinion data from

reviews texts.

 Support high distribution of data, providing

optimal partitions w.r.t. to data usage.

 Provide fresh data by migrating as quickly as

possible published reviews.

 Adapt as much as possible existing vocabularies in

e-commerce in order to facilitate the load of data

from different sources.

 Ensure quality and homogeneity of the i-SLOD

datasets, dealing with the multi-lingual issues of

this BI scenario.

3 i-SLOD DATASETS

In this section, we briefly describe the main datasets

that will constitute the i-SLOD data infrastructure

i-SLOD:TowardsanInfrastructureforEnablingtheDisseminationandAnalysisofSentimentData

215

(inner ring of Figure 2). The main criteria we have

followed to define these datasets are the following:

 Take profit from existing vocabularies and

schemas as much as possible.

 Distribute linked data according to both the

identified BI demands and the fact extraction from

raw texts.

 Keep the inner datasets coherent.

The rest of the section shows the most relevant

aspects of the datasets included in each component.

3.1 Items Component

This component contains the datasets describing

concrete products and services as well as their

manufacturers (e.g., product brand). These datasets

must be kept as simple as possible just providing the

attributes useful for BI tasks. Other attributes and

relationships can be accessed through the links to

externals datasets such as eCl@ss, DBpedia,

ProductDB, FreeBase, etc. For the sake of

simplicity, this component just regards two root

classes: Item and Manufacturer. The schema for the

former is shown in Table 1.

Table 1: Item i-SLOD schema.

Property Description

s:itemID

Unique identifier of the item.

gr:hasManufacturer

URI of the manufacturer.

rdf:label

Item name.

slod:onDomain

Item family.

rdf:type

Type of item (product, service).

For this component we adopt the vocabularies of

Schema.org (s) and GoodRelations (gr). It is worth

mentioning that, although there are several datasets

about products in the LOD cloud, they do not cover

all products and services. In order to perform BI

tasks, this is a serious limitation since an analytical

query requires all data be expressed under the same

schema. This limitation is present in MARL

approach (Westerski and Iglesias, 2011), as opinion

products are arbitrarily linked to either external

datasets or literals. In our case, we propose a

homogeneous schema, which can be further linked

to external datasets.

3.2 Facets Component

This component comprises all the elements subject

to evaluation in the opinions. In this work, the

concept feature is used for denoting concrete

physical parts of an item (e.g., zoom, room, etc.),

whereas the concept aspect is used for abstract

concepts (e.g., design, price, etc.).

Table 2: Facets i-SLOD Schema.

Property Description

slod:facetID

Unique identifier of a facet.

rdf:label

Facet labels.

slod:onDomain

Item family to which it is defined.

rdf:type

Facet type (feature, aspect, etc.).

There are few LOD datasets including facets subject

to opinions. We can find technical specifications

about products like in eCl@ss, but they do not cover

well the features customers usually opine (García-

Moya et al., 2013b). As a consequence, sentiment

analysis approaches aim at extracting these features

directly from text reviews by applying machine

learning techniques (Liu, 2012).

Indeed, one of the i-SLOD goals is to

conceptualize and make public facets that can be

automatically extracted from reviews. For this

purpose, we propose a simple schema (see Table 2)

to which item facets must map to. The main issues

for performing these mappings are: to group together

expressions denoting the same facet, to distinguish

between features and aspects, and to classify

features w.r.t. aspects. Our starting point for

addressing these issues is the statistical approach

presented at (García-Moya et al., 2013b).

3.3 Indicators Component

Sentiment analysis relies on the existence of a set of

words and expressions that indicate some opinion

about a subject. The Indicators component is mainly

based on linguistic resources that allow identifying

facets from review texts as well as sentiments

associated to them.

3.3.1 Opinion Words

Opinion words, also known as sentiment words, are

the most important indicators of sentiments about a

subject. These are words commonly used to express

positive or negative opinions. For example excellent,

amazing, good are positive words whereas bad,

terrible, awful are negative ones. Additionally, there

also exist sentences used for expressing opinions, for

example, cost a pretty penny, cost an arm and a leg

or cost the earth, in this case all are referring to the

indicator concept expensive.

Opinion indicators could be defined as context-

independent or context-dependent (Lu et al., 2011).

An opinion indicator is context-dependent when its

polarity depends on the domain and/or the features it

is modifying (e.g., unexpected for movies (+) and

electronic devices ()). Even within the same

domain, the polarity of an indicator may be different

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

216

depending on the feature. For example, the word

long in digital cameras: “long delay between shots”

() and “long battery life” (+). Another interesting

kind of opinion indicators consists of expressions

that implicitly bring the feature. For example, the

indicator “too expensive” refers to the aspect

“price”.

For the Indicators component we propose two

classes:

slod:Indicator and slod:Polarity. Table 3 shows

the main properties for the indicator class according

to the previous comments.

Table 3: Properties for opinion indicators.

Property Description

slod:indicatorID

Unique identifier of a sentiment.

rdf:label

Sentiment words and sentences.

rdf:type

Type of indicator.

slod:onFacet

Associated facet (implicit/context).

slod:hasPolarity

Polarity associated to the indicator.

Nowadays there exist many sentiment lexicons,

some of them available in LOD. The most popular

ones are SentiWordNet (Esuli and Sebastiani, 2006)

and SenticNet (Cambria et al., 2013), which provide

sentiment-based characterizations for common

words in English. Unfortunately, these lexicons are

of limited use because they are only applicable to

English-written reviews, and they do not take into

account context-based indicators (Lu et al., 2011). It

is worth mentioning that there exist also some web

services like SentiStrength (Thelwall et al., 2010)

that compute polarities from free-texts. This kind of

services could be applied over this dataset to infer

the values of

slod:hasPolarity.

3.3.2 Opinion Shifters

Opinion indicators may not be sufficient to

determine the true or contextual polarity of the

feature. The valence of a polar term may be

modified by one or more words, called contextual

valence shifters. These shifters can be categorized

into several types, some of them are: negations (not,

never, none, etc.), intensifiers (deeply, very, little,

rather, etc.), modal shifters (might, possibly, etc.),

and presuppositions (e.g., lack, neglect, fail, etc.)

There are other kinds of shifters (Polanyi and

Zaenen, 2006), but they are less useful for BI

Table 4: Properties for opinion shifters.

Property Description

slod:shifterID

Unique identifier for the shifter.

slod:change

Change applied to the indicator.

rdf:label

Expresions associated to the shifter.

rdf:type

Type of shifter.

analysis. Table 4 shows the main properties of the

shifter class.

3.4 Reviews Component

Currently, we can find many proposals for

representing review metadata in LOD. One of the

main references is Schema.org, which has been

adopted by Google for rich snippets over reviews.

This vocabulary covers all aspects we need for the

Reviews component, and therefore we have adopted

it without extensions. Table 5 shows some properties

associated to the review class.

Table 5: Properties for review objects.

Property Description

s:reviewrating

Overall assessment (s:rating).

s:itemreviewed

Item reviewed.

s:reviewer

Author of the review.

s:dtreviewed

Publication date of the review.

3.5 Opinion Facts Component

Opinion facts express the associations between

features/aspects to opinion indicators that appear at

the review texts.

Table 6: Opinion facts properties.

Property Description

slod:opinionId

Unique identifier of an opinion fact.

slod:onFacet

Opined facet.

slod:fromReview

Review reference.

slod:onTargetItem

In comparisons, the compared item.

slod:compOperator

In comparisons, the operator being

applied (e.g., better, worst, faster, etc.)

In our approach, an opinion fact is always linked to

the review object from which it was identified.

Consequently, each opinion fact takes the time and

place dimensions from its linked review. Thus, the

schema of an opinion fact can be just expressed with

the feature/aspect and indicator/shifters involved in

the fact. Table 6 summarizes the properties

associated to the opinion fact class.

Another kind of opinion facts regarded in (Liu,

2012) is that of product comparisons. To represent

comparisons, two properties to the opinion fact class

are added:

slod:onTargetItem and slod:comOperator.

Notice that we can combine these properties to

express for example a comparison between two

products w.r.t. some aspect (e.g., “it has better zoom

than camera Y”).

The most similar approach for expressing

opinions in LOD is that of MARL (Westerski and

Iglesias, 2011). The main differences of our

i-SLOD:TowardsanInfrastructureforEnablingtheDisseminationandAnalysisofSentimentData

217

approach w.r.t MARL are the following ones. In our

approach, opinion facts must be always linked to

datasets within i-SLOD. In this way, we can ensure

coherence and homogeneity of data for BI analysis.

Moreover, our proposal uncouples the opinion fact

from its polarity, which should be inferred from

indicators and shifters. Finally, we do not allow

opinion aggregations, as they will be performed by

the analytical tools (see Section 4.3).

4 i-SLOD POPULATION

This section discusses how to populate the main

components of i-SLOD data infrastructure.

4.1 ELT Processes

Similarly to traditional data warehouses (DW), we

propose to populate the i-SLOD infrastructure by

means of Extraction, Load and Transform (ETL)

processes. These processes will be in charge of

continuously processing published reviews to update

i-SLOD datasets. In this context, each component

presents a different dynamicity degree. For example,

review and opinion facts will grow very quickly,

whereas products, features and indicators will

change more slowly.

Table 7: Proposed i-SLOD ETLs.

Component Operators Dynamicity

Product/Service LOD Linking Low

Feature/Aspects

Sentiment analysis

LOD Linking

Low

Opinion

indicators

Lexica extraction

Sentiment analysis

Low

Review

Microdata

xtraction

LOD Linking

High

Opinion fact

Semantic

Annotation

High

Unlike traditional ETLing, i-SLOD processes deal

with RDF and web data. Table 7 shows the main

ETL operators involved in the i-SLOD components.

As it can be noticed, one critical operator consists of

linking all the loaded data to internal and external

datasets (see Figure 2). Another critical operator

consists of applying sentiment analysis to extract

and rank relevant feature/aspects and indicators to be

included in the corresponding datasets.

4.2 Semantic Annotation

We propose to apply automatic semantic annotation

for extracting opinion facts from raw texts, and

linking data. Semantic annotation consists in

identifying concept mentions in the free-texts in

order to link them to existing knowledge resources.

This technique is gaining popularity within the LOD

community as it allows linking unstructured data to

reference knowledge resources (Mendes et al.,

2011). Unfortunately, current tools are all targeted to

Wikipedia.

In our context, semantic annotation should be

performed with any lexicon that can be extracted

from the i-SLOD datasets (

rdf:label statements).

Particularly, we are interested on identifying

features, indicators and shifters in the review text to

extract opinion facts. An example of opinion fact

extraction is shown in Table 8.

Table 8: Example of opinion facts.

review1: “I don’t like the image and sound of this camera”

(slod:oatom1, slod:fromReview, slod:review1)

(slod:oatom1, slod:onFacet, slod:feature123)

image

(slod:oatom1, slod:withIndicator, slod:indct2)

(slod:oatom1, slod:hasShifter, slod:shifter10)

don’t

(slod:oatom2, slod:fromReview, review1)

(slod:oatom2, slod:onFacet, feature231)

sound

(slod:oatom2, slod:withIndicator, slod:indct2)

(slod:oatom2, slod:hasShifter, slod:shifter10)

don’t

The work in (García-Moya et al., 2013a) will serve

us as basis to define the tailored semantic annotators

necessary to extract opinion facts.

4.3 BI Analysis in i-SLOD

The i-SLOD infrastructure is meant to hold large

datasets of semi-structured data. Linked data is used

as an integrating tool and provides a new

architectural pattern for mapping and

interconnecting data from a variety of sources. Such

infrastructure should provide the analyst with the

means for executing analytic queries.

BI tools provide a summarized view by

aggregating the data over numerical measures

according to contexts (i.e., dimensions). However,

traditional BI is not suitable for linked data.

Complex queries over the i-SLOD infrastructure

require a data processing model for a cloud

architecture that integrates advanced information

extraction and advanced analysis operations (i.e.,

OLAP operators). Fur such purpose, the datasets in

the inner ring of i-SLOD can be partitioned and

distributed according to the BI demands. For

example, datasets can be partitioned with respect to

domains and time slices. Moreover, functional map-

reduce implementations (Dean and Ghemawat,

2004) can process such distributed partitions and

KDIR2013-InternationalConferenceonKnowledgeDiscoveryandInformationRetrieval

218

parallelize complex analysis operators such as filter,

join and aggregate (Sridhar et al., 2009).

In order to speed-up costly operations within the

inner i-SLOD datasets, additional indexing

mechanisms can be applied. For example, instead of

performing a join between the opinion atom and the

indicator datasets every time a user asks a query

involving such datasets, we can build an index that

associates each opinion atom with its indicators.

More challenging is however, to efficiently perform

BI operations involving external datasets, as we do

not have control over the external sources.

On the other hand, the semantics introduced by

the linked data flavour of the i-SLOD also require

new scalable, distributed reasoning techniques able

to efficiently compute new inferences so that they

can be used in the analysis process.

5 CONCLUSIONS

We have presented i-SLOD, a proposal for a data

infrastructure of open linked sentiment data. Its

purpose is to facilitate the massive analysis of

sentiment data by exploiting the ever-increasing

amount of publicly available open linked data.

The i-SLOD components are designed to

describe all necessary information for opinion

analysis (products/services, features/aspects, and

opinion indicators, reviews and facts), and also to

incorporate the functionality required to perform

massive opinion analysis: the extraction of opinion

facts from text reviews, and the linkage of opinion

data to other datasets, using semantic annotation as a

key enabling technology.

This allows the exploitation of opinion-related

dimensions of analysis that are out of reach for

traditional BI applications, thus allowing the

incorporation of crucial strategic information.

ACKNOWLEDGEMENTS

This work has been partially funded by the

“Ministerio de Economía y Competitividad” with

contract number TIN2011-24147.

REFERENCES

Cambria, E., Song, Y., Wang, H., Howard, N. (2013).

Semantic Multi-Dimensional Scaling for Open-

Domain Sentiment Analysis. IEEE Intelligent Systems,

DOI: 10.1109/MIS.2012.118.

Dean, J., Ghemawat, S. (2004). Mapreduce: Simplified

data processing on large clusters. OSDI ’04, pages

137–150.

Esuli, A., Sebastiani, F. (2006). SENTIWORDNET: A

Publicly Available Lexical Resource for Opinion

Mining. In Proc. LREC’06, 417-422.

García-Moya, L., Kudama, S., Aramburu, M.J., Berlanga,

R. (2013a). Storing and analysing voice of the market

data in the corporate data warehouse. Information

Systems Frontiers, 1-19, DOI: 10.1007/s10796-012-

9400-y.

García-Moya., Anaya-Sánchez, H., Berlanga, R. (2013b).

A Language Model Approach for Retrieving Product

Features and Opinions from Customer Reviews, IEEE

Intelligent Systems, DOI: 10.1109/MIS.2013.37.

Heath, T., Bizer, C. (2011). Linked Data: Evolving the

Web into a Global Data Space. Morgan & Claypool,

San Rafael, CA, 1st Edition.

Liu, B., (2012). Sentiment Analysis and Opinion Mining,

Morgan & Claypool Publishers.

Lu, Y., Castellanos, M., Dayal, U., Zhai, C. X. (2011).

Automatic construction of a context-aware sentiment

lexicon: an optimization approach. WWW 2011, 347-

356.

Mendes, P., Jakob, M., García-Silva, A., Bizer, C., 2011.

DBpedia spotlight: shedding light on the web of

documents. In Proc. of I-Semantics '11.

Polanyi. L, Zaenen, A. 2006. Contextual valence shifters.

Computing Attitude and Affect in Text: Theory and

Applications: The Information Retrieval Series

Volume 20, 1-10, doi: 10.1007/1-4020-4102-0_1.

Sridhar, R., Ravindra, P., Anyanwu, K. (2009). RAPID:

Enabling Scalable Ad-Hoc Analytics on the Semantic

Web. In Proc.s of ISWC '09, 715-730.

Thelwall, M., Buckley, K., Paltoglou, G. Cai, D., Kappas,

A. (2010). Sentiment strength detection in short

informal text. JASIST, 61(12), 2544–2558.

Westerski, A., Iglesias, C. A. (2011). Exploiting

Structured Linked Data in Enterprise Knowledge

Management Systems. An Idea Management Case

Study. In Proc. EDOCW, 395-403.

i-SLOD:TowardsanInfrastructureforEnablingtheDisseminationandAnalysisofSentimentData

219