Theoretical Challenges in Knowledge Discovery in Big Data

A Logic Reasoning and a Graph Theoretical Point of View

Pavel Surynek

and Petra Surynková

Charles University Prague, Faculty of Mathematics and Physics, Department of Theoretical Computer Science

and Mathematical Logic, Malostranské náměstí 25, 118 00 Praha, Czech Republic

Charles University Prague, Faculty of Mathematics and Physics, Department of Mathematics Education,

Sokolovská 83,186 75 Praha 8, Czech Republic

Keywords: Big Data, Data Analysis, Logic Reasoning, Graph Theory, Graph Drawing, Propositional Satisfiability.

Abstract: This paper addresses a problem of knowledge discovery in big data from the point of view of theoretical

computer science. Contemporary characterization of big data is often preoccupied by its volume, velocity of

change, and variety that causes technical difficulties to handle the data efficiently while theoretical chal-

lenges that are offered by big data are neglected at the same time. Contrary to this preoccupation with tech-

nical issues, we would like to discuss more theoretical issues focused on the goal briefly expressed as what

be understood from big data by imitating human like reasoning through logic and algorithmic means. The

ultimate goal marked out in this paper is to develop an automation of the reasoning process that can manipu-

late and understand data in volumes that is beyond human abilities and to investigate if substantially differ-

ent patterns appear in big data than in small data.

1 INTRODUCTION

Contrary to contemporary understanding of big data

(Laney, 2012), which focuses on technological man-

aging of difficulties arising from its still increasing

volume, velocity of change, and growing variety of

sources, we would like to discuss issues connected

with a question what can be learned from big data by

computational techniques. That is, we would like to

discuss the big data challenge more from the point of

view of theoretical computer science and artificial

intelligence (Russell and Norvig, 2009). To simplify

the situation we need to look aside from technical

issues for now. Regarding mentioned technical diffi-

culties known as ‘V’s (velocity, volume, variety,

value, veracity) let us settle with any solution that

allows us to access data in a convenient way and do

not address this issue any further.

What we consider more exciting about big data

than managing their volumes and what is currently

addressed insufficiently is automated interpretation

of data and automated learning from them. This

issue has not yet been addressed in any significant

extent and even the terminology for describing prob-

lems we would like to discuss is lacking. Many

techniques already exist, but they are scattered in

many other areas and not focused on big data direct-

ly. It is one of the goals of this paper is to point out

techniques that can be employed in big data pro-

cessing. The next goal is to show problems that arise

in big data and that can be studied theoretically.

Terms of knowledge discovery and reasoning in big

data are closest titles for problems we consider in-

teresting from theoretical point of view that we

would like to discuss. However, these titles should

be understood as working ones.

We will pick several concrete theoretical prob-

lems in big data to describe current big data chal-

lenges concretely. A solving approach, that should

be considered and that is promising for a thorough

investigation, is suggested for each of the mentioned

problems. We will show big data problems from the

perspective theoretical fields of mathematical logic

and graph theory. All the concrete problems are put

into context of related works and background; thus

this work may serve a brief survey as well.

1.1 Big Data (vs. Small Data)

The core inspiration for the discussion is a question

how to imitate human like reasoning over small data,

which are met by humans every day, by algorithmic

techniques. Such automation allows applying of the

imitated reasoning on large amounts of data that is

327

Surynek P. and Surynková P..

Theoretical Challenges in Knowledge Discovery in Big Data - A Logic Reasoning and a Graph Theoretical Point of View.

DOI: 10.5220/0005092503270332

In Proceedings of the International Conference on Knowledge Engineering and Ontology Development (KEOD-2014), pages 327-332

ISBN: 978-989-758-049-9

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

beyond human capabilities. The adopted prerequisite

in this study is that humans use logical reasoning.

Consider for example human ability of driving a

car. The driver perceives visual, audio, and tactile

data, which he quickly processes to make efficient

decisions such as to accelerate. The situation with

the human driver can be regarded as a small data

world (even though one may object that the amount

of processed data is still big). A corresponding big

data world may take into consideration all the cars in

a city or even a country at once. The outcome of the

automated reasoning over such big data situation

may be a prediction or a decision that for example

prevents a traffic jam.

Analogical examples can be found in how hu-

mans extract knowledge from textual data, how they

combine facts to answer questions, or how they

understand social relations to join profitable coali-

tions. Successful automation in such cases brings

possibility of building knowledge from whole librar-

ies of text or predicting large-scale social trends

based on understanding relations of large communi-

ties.

A very interesting question is that if substantially

different patterns appear in big data from those that

appear in small data. That is, if quantity of data leads

to a quality that cannot be observed in small data.

We consider this question as ultimate goal of effort

in understanding big data through logical, algorith-

mic, and graph theoretical means.

2 CHALLENGES IN BIG DATA

The basic challenge in big data can be characterized

as knowledge discovery. Extracting knowledge from

big data is a prerequisite for making automated rea-

soning over the data.

One of the concrete approaches to knowledge

discovery in data from the point of view of theoreti-

cal computer science is to try to find a (logical)

theory that represents data in a compact form. The

intuition behind this approach is that the compact

form of the representation inherently induces certain

kind of understanding, explanation, or structural

insight – without understanding and discovering

intrinsic rules in the data set, the compactness would

be impossible (see Figure 1 for illustration of this

intuition).

If data are interpreted as facts or statements, the

aim is to find a theory in which these facts or state-

ments are valid (Dwe Battista et al., 1998). Formally

said, the set of models of the theory would be equal

to the represented data set (Hodges, 1993). Regard-

ing equality between both sets, one does not need to

be that strict. Certain level of approximation of the

data set by the theory should be also considered.

Availability of such a theory then allows further

decision making like checking of validity of new

propositions, checking of consistency of a set of

statements, finding the smallest set of statements

that lead to a contradiction with the theory, and

many other decisions known from logic reasoning.

Figure 1: Data representation as a set of models of a logi-

cal theory T. The set of models of the theory M(T) approx-

imates the input big data. The theory should be small

through which certain level of understanding or explana-

tion of data can be obtained.

Considering data in their big amounts may lead

to finding novel understandings and interpretations.

Historically, logic theories were used as formaliza-

tions of human reasoning hence it is quite natural to

apply automated logic reasoning to process large

amounts of data, which is consistent with suggested

original inspiration. A question how to find logical

representations of data sets algorithmically is dis-

cussed in following sections. Several approaches

that should be further elaborated are suggested.

2.1 Compact Data Representation for

their Better Understanding

Assume that the input data has the form of a set of

logical statements. An important pool of techniques

that should be considered consists of compression

techniques for such a set of logical statements.

Compression is regarded as a tool for discovering

compact explanation of the given set of data.

The first step is to model (logical) data as a set of

vectors over the propositional or multiple-value

domain. Then it is almost immediate idea to investi-

gate possibilities of their representation using some

existing concept such as binary decision diagrams

(BDDs) (Akers, 1978) or multi-value decision dia-

grams (MDDs) (Miller and Drechsler, 1998). Alt-

hough mentioned concepts are primarily intended as

Bigsetoffacts

M(T)

T–acompacttheory

KEOD2014-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

328

compact representations of the set of models of a

certain formula (Rice, 2008) the huge source of

results in this topic can be utilized in big data re-

search as well.

Techniques for constructing decision diagrams

themselves may be enriched within big data research

as we expect big data to offer different challenges.

The major difference can be observed in the fact that

the whole process in data representation is reversed

if it is compared with representation of models of a

formula. Normally, the set of models of the formula

is found and captured explicitly by the decision

diagram. In data representation on the other hand,

we start with explicit data set and through the inter-

mediate step consisting of a decision diagram we

want to understand the data. That is, to find a formu-

la or a set of formulae (a theory), in which data are

valid.

Decision diagrams are not the only concepts for

data representation and compression. Another inter-

esting method for data compression is represented

by matrix factorization (Koren et al., 2009) and

matrix sketching (Liberty, 2013). The former one

has been recently successfully employed in recom-

mender systems (Ricci et al., 2011). These methods

compress large sparse matrices by representing them

as products of smaller matrices where certain toler-

ance is given to the accuracy of the represented

matrix. They are particularly attractive for their

ability to discover hidden interpretations of data,

which has been demonstrated by discovering hidden

features in case of recommender systems.

Another interesting way to discover knowledge

is to extract information from some kind of compu-

tational model or classifier known from machine

learning (Mitchell, 1997) such as neural network

(Zhang, 2000) or Bayesian network ( Pearl, 1988).

This approach has been already successfully used in

many variations. The most notable example of

knowledge extraction from the computational model

has been done with neural network from which logic

programs were extracted (Lehmann et al., 2010).

It seems to be promising to continue in research

in knowledge extraction from computational models

in the context of big data. A suitable computational

model can be learned from the input training data

and then further processed. The advantage here is

that many efficient training algorithms for construct-

ing computational models from training data already

exist – in case of neural networks, back-propagation

algorithm (Rumelhart et al., 1986) exists to name

some. However, the large size of training data must

be considered at this stage when dealing with big

data. As existing training algorithms are not primari-

ly designed for big data, the situation may lead to

developing novel training methods in order to man-

age learning stage in acceptable time. In any case, it

is expected that the outcome of the process will be a

computational model that represents training data in

the compact form. Then information can be extract-

ed from the computational model.

The concrete way how to extract information is

subject of further research and cannot be answered

within this discussion. Nevertheless, it is assumed

that the target of information extraction will be cer-

tain logic theory. No less important advantage of

machine learning techniques is that they are typical-

ly robust with respect to inconsistencies and inaccu-

racies in the training data sets. Inconsistency repre-

sents an important issue in big data collected from

some real-life source (Huang et al., 2013). Thus, the

burden of dealing with data inconsistencies can be

partly passed on learning process of the given com-

putational model.

2.2 Deciding Big Data Problems in

Description Logic through SAT

Solving

A well-developed framework that provides rich

description concepts and variety of decision methods

is represented by description logic (DL) (Knorr et

al., 2011). Currently, description logic is often used

as a knowledge representation tool in semantic web

and bioinformatics as it excels in expressing state-

ments about individuals from some domain (such as

medicine). Decision problems in description logic

include testing if certain individual belong to given

category or whether given individuals are bound

together by a relation. Generally, decision problems

in description logic can be regarded as more ad-

vanced and more complex variant of database query-

ing (Bienvenu et al., 2013). Again, it is very interest-

ing to use DL for representation of big data sets; and

to apply DL reasoning and decision procedures to

derive meaningful facts from the data set.

DL itself is extremely broad topic, thus a realis-

tic attitude towards DL in perspective of big data is

rather to just pick decision procedures suitable for

application in big data reasoning and eventually to

adapt and improve these procedures. The problemat-

ic point of application of DL with respect to big data

is complexity of its decision procedures (Lutz,

2002). Although problems in DL are mostly decida-

ble, the complexity of associated decision proce-

dures is often too high to be considered scalable –

usually decision problems are PSPACE-complete

(Baader et al., 2008), which practically means in-

TheoreticalChallengesinKnowledgeDiscoveryinBigData-ALogicReasoningandaGraphTheoreticalPointofView

329

tractability especially when the input is big data.

There are certain restrictions such as Horn-DL

(Krötzsch et al., 2013) in which some decision prob-

lems are easier, that is in P, which makes it an inter-

esting option for reasoning with big data.

However, the drawback of easier decision proce-

dures may be that information discovered by such a

procedure is invaluable. Usually worthwhile know-

ledge or information is difficult to discover, there-

fore a decision procedure that employs search to

certain extent is needed.

A possible way to tackle difficulty of deciding in

DL is to investigate possibilities of applying modern

SAT solvers (Eén and Sörensson, 2004), (van

Maaren and Franco, 2013), which are famous for

their efficiency in searching for a solution, which is

in their case a valuation of propositional variables

that satisfies the given propositional formula. To

make application of SAT solvers possible on

knowledge discovery in big data an encoding of

associated decision problems as propositional satis-

fiability is needed. Some advances in modeling

decision problems in DL as SAT has been already

made (Sebastiani and Vescovi, 2009). This recent

progress is focused on modeling classical queries of

DL in propositional satisfiability. A promising re-

search direction is to find how to enrich this ap-

proach with the aspect of large amounts data trans-

lated to propositional statements or facts. It is known

that state-of-the-art SAT solvers can find satisfying

valuation of formulae containing up to millions of

variables – such formulae often appear in hardware

verification. Big data may become another domain

where SAT solvers are successfully applied as such

data are expected to contain regular patterns similar-

ly as it is in the case of hardware verification formulae.

As it has been mentioned, data collection may

contain inaccuracies and inconsistencies, which may

compromise the application of crisp reasoning

methods like SAT, which does distinguish only two

cases – satisfiable and unsatisfiable but nothing in

between. The situation is different in MaxSAT

(Argelich et al., 2008), (Battiti and Protasi, 1998)

where it is tried to satisfy maximum number of

clauses in the given propositional formula. Such

kind of optimization is worth considering for model-

ing problems in knowledge discovery. For example

finding maximally consistent subset of statements in

big data set is a viable candidate for such modeling.

2.3 Visualization and Analysis of Big

Data Supported by Graph

Theoretical Techniques

Lot of understanding of not only big data but also

data generally can be bolstered by visualization.

The fascinating point with data visualization is that

it combines computer graphics and combinatorial

problem solving which represents a nice opportunity

for cross-fertilization.

Figure 1: An interpretation of linked data as a chordal

graph. The left graph H is a representation of a small case

of linked data. The right graph is an alternative representa-

tion of links by intersections between chords of the cycle.

One of the aims of the project is to find suitable visualiza-

tions through various types of intersection graphs for big

cases of linked data.

Here, WE would like to discuss more the combi-

natorial aspect of data visualization. Data has the

form of relations in many cases (Hitzler & Janowicz,

2013) where the relation says if a given tuple of

objects are related or not. Special kind of relation is

a binary relation, which considers ordered pairs of

objects. This is the most frequent relation and the

most studied one. Binary relation can be also under-

stood as a link between given objects. Therefore,

data consisting of such relations are called linked

data and their processing is called linked data analy-

sis (Joshi et al., 2013).

The data set containing binary relations can be

abstracted as a directed graph where objects are

represented as vertices and binary relations between

objects are represented as directed edges (or links;

usually depicted like arrows). Having a graph or a

c

f

g

1

2

3

4

5

6

7

8

9

G=(V,E)

V={a,b,...,g}

E={1,2,...,9}

3

7

8

9

H=(E,{{α,β}| α∊E,β∊Eα∩β≠∅})

KEOD2014-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

330

big graph in the case of processing big data, we

immediately face the problem how to visualize it or

draw it. A classical problem of drawing a graph in

plane where edges do not intersect, which gave rise

to the definition of planar graphs for which it is

possible (Di Battista et al., 1998), can serve as a

starting point. One can also optimize the number of

edge intersections to obtain best possible drawing of

a graph.

Huge amounts of results exist in graph visualiza-

tion. There is even a conference dealing solely with

combinatorial aspects of graph visualization (Di

Battista et al., 2014). The challenge connected with

big data is that considered graphs are extremely

large. Thus even polynomial time algorithms in

other areas considered as efficient may be prohibi-

tively slow in the case of big data. Hence, methods

that process tasks connected with visualization in

linear time should be in focus.

Many efficient (linear-time) visualization tech-

niques for graphs can be found in so-called intersec-

tion graphs (Golumbic, 1980). Edges in intersection

graphs are defined as intersection between some

objects such as intervals or chords within a cycle

(see Figure 1 for illustration of a chordal graph). The

important feature of intersection graphs is that cer-

tain visualization is captured directly by the defini-

tion. Special objects that do intersect give the result-

ing graph special properties. Typically combinatorial

problems, which are difficult in general graphs such

as determining the chromatic number or the clique

number, are easy in some cases of intersection

graphs. It is worth studying if intersection graphs

can be derived from big data and if this knowledge

can be utilized in efficient data visualization.

Generally, we consider visualization as a tool to

discover new hypotheses about visualized concepts.

Data visualization has been applied with non-trivial

success to find new ways how to optimize solution

of problems in theoretical robotics (Surynek, 2011).

Therefore, we expect lot from visualization in big

data analysis.

3 EXPECTED PROGRESS

We would like summarize progress that we expect in

mentioned aspects of big data processing in this

section.

It is expected to find concepts that allow under-

standing of (big) datasets through compression. A

variant of decision diagram that allows compact

representation of dataset from which important fea-

tures of data can be extracted (similarly as it is done

in case of decision trees) is an expectable result for

instance. Fundamental properties of suggested con-

cepts are expected to be described and evaluated by

means of theoretical computer science.

Efficient encodings of decision problems from

big data into propositional satisfiability are expected

to be found for example. This is connected with

identifying interesting decidable problems in big

data. An extension of existent applications of SAT in

description logic to big data issues is expected.

Again, fundamental properties should be described

and theoretically as well as experimentally evaluat-

ed. Overcoming the crisp reasoning in SAT paradigm

to make it suitable for supposedly inaccurate data

possibly by shifting to MaxSAT would be valuable.

There are two expectable types of contributions

regarding data visualization. The first should be

development of a collection of supportive software

prototypes to enable observation of big data through

innovative visualizations. The supporting role of

such software consists in helping to understand what

is important in big data, which can show promising

research directions. The second type of outcome is

represented by fundamental combinatorial findings

that allow visualization. Discovery of suitable graph

drawing techniques is expected.

In my opinion, the ultimate type of contribution

to the research in big data would be a discovery of a

pattern that structurally distinguishes big data collec-

tion from the small one. That is, a pattern that is not

observable in the small scale.

4 CONCLUSIONS

My goal has been to indentify several challenges in

big data research from the point of view of theoreti-

cal computer science and artificial intelligence.

The paramount problem in big data we have fo-

cused on is knowledge discovery. Several particular

problems related to knowledge discovery are identi-

fied and approaches how to address them are dis-

cussed. We identify three challenges and their pro-

spective solutions:

(i) Knowledge discovery through compression of

the set of facts is suggested to be solved by using

decision diagrams like BDD or MDD.

(ii) Decision problems in big data are suggested

to be solved by translating them to description logic.

Possible solution to tackle the complexity of associ-

ated decision procedures is application modern SAT

solvers.

(iii) Finally, we see a great potential in solving

combinatorial problems related to big data visualiza

TheoreticalChallengesinKnowledgeDiscoveryinBigData-ALogicReasoningandaGraphTheoreticalPointofView

331

tion if regarded as graphs.

The paper also represents a brief survey of theo-

retically oriented works applicable in knowledge

discovery.

ACKNOWLEDGEMENTS

This research is work is supported by the Czech

Science Foundation under the contract number

GAP103/10/1287.

REFERENCES

Akers, S.: Binary decision diagrams. IEEE Transactions

on Computers, Vol. 27(6), 509–516, IEEE Press,

1978.

Argelich, J., Li, C.-M., Manya F., Planes, J.: The First and

Second Max-SAT Evaluations. Journal on Satisfiabil-

ity, Vol. 4, 251-278, IOS Press, 2008.

Baader, F., Hladik, J., Peñaloza, R.: Automata can show

PSpace results for Description Logics. Information

and Computation, Special Issue: LATA 2007, 206(9-

10):1045-1056, Elsevier, 2008.

Battiti, R., Protasi, M.: Handbook of Combinatorial Opti-

mization. Kluwer, 1998.

Bienvenu, M., Ortiz, M., Simkus, M.: Conjunctive Regu-

lar Path Queries in Lightweight Description Logics.

Proceedings of IJCAI 2013, IJCAI/AAAI, 2013.

Di Battista, G., Eades, P., Tamassia, R., Tollis, I. G.:

Graph Drawing: Algorithms for the Visualization of

Graphs. Prentice-Hall, 1998.

Di Battista, G., Tamassia, R., Tollis, I. G.: International

Symposium on Graph Drawing. Web Page,

http://www.graphdrawing.org/, 2014, [accessed in

February 2014].

Eén, N., Sörensson, N.: An Extensible SAT-solver. Pro-

ceedings of SAT 2003, LNCS 2919, 502-518, Spring-

er Verlag, 2004.

Golumbic, M. C.: Algorithmic Graph Theory and Perfect

Graphs. Academic Press, 1980.

Hitzler, P., Janowicz, K.: Linked Data, Big Data, and the

4th Paradigm. Semantic Web, 4(3), 233-235, IOS

Press, 2013.

Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of

Semantic Web Technologies. Textbooks in Computing,

Chapman and Hall/CRC Press, 2009.

Hodges, W.: Model Theory. Cambridge University Press,

1993.

Huang, S., Li, Q., Hitzler, P.: Reasoning with Inconsisten-

cies in Hybrid MKNF Knowledge Bases. Logic Jour-

nal of the IGPL 21 (2), 263-290, Oxford University

Press, 2013.

Joshi, A., Hitzler, P., Dong, G.: Logical Linked Data

Compression. Proceedings of ESWC 2013, LNCS

7882, 170-184, Springer Verlag, 2013.

Knorr, M., Alferes, J. J., Hitzler, P.: Local Closed-World

Reasoning with Description Logics under the Well-

founded Semantics. Artificial Intelligence 175 (9-10),

1528-1554, Elsevier, 2011.

Koren, Y., Bell, R. M., Volinsky, C.: Matrix Factorization

Techniques for Recommender Systems. IEEE Comput-

er, Vol. 42 (8), 30-37, IEEE Press, 2009.

Krötzsch, M., Rudolph, S., Hitzler, P.: Complexities of

Horn Description Logics. ACM Transactions on

Computational Logic, Vol. 14 (1), ACM Press, 2013.

Laney, D.: The Importance of 'Big Data': A Definition.

Gartner, 2012.

Lehmann, J., Bader, S., Hitzler, P.: Extracting Reduced

Logic Programs from Artificial Neural Networks. Ap-

plied Intelligence, Vol. 32(3), 249-266, Springer Ver-

lag, 2010.

Liberty, E.: Simple and deterministic matrix sketching.

Proceedings KDD 2013, 581-588, ACM Press, 2013.

Lutz, C. The complexity of Description Logics with con-

crete domains. PhD Thesis, LuFG Theoretical Com-

puter Science, RWTH Aachen, Germany, 2002.

Maaren, H. van, Franco, J.: The international SAT Compe-

titions. Competition web page, http://www. satcompe-

tition.org/, 2013, [accessed in February 2014].

Miller, D. M., Drechsler, R.: Implementing a multiple-

valued decision diagram package. Proceedings of

ISMVL 1998, 52-57, IEEE Press, 1998.

Mitchell, T.: Machine Learning. McGraw Hill, 1997.

Pearl, J.: Probabilistic Reasoning in Intelligent Systems.

Morgan Kaufmann, 1988.

Ricci, F., Rokach, L., Shapira, B., Kantor, P.B. (editors):

Recommender Systems Handbook. Springer Verlag,

2011.

Rice, M.: A Survey of Static Variable Ordering Heuristics

for Efficient. BDD/MDD Construction. University of

California, 2008.

Rumelhart, D. E., Hinton, G. E., Williams, R. J.: Learning

representations by back-propagating errors. Nature,

Vol. 323 (6088): 533–536, Nature Publishing, 1986.

Russell, S. and Norvig, P.: Artificial Intelligence: A Mod-

ern Approach. Prentice Hall, 2009.

Sebastiani, R., Vescovi, M.: Automated Reasoning in

Modal and Description Logics via SAT Encoding: the

Case Study of K(m)/ALC-Satisfiability. J. AI Res., Vol.

35: 343-389, AAAI Press, 2009.

Surynek, P.: Redundancy Elimination in Highly Parallel

Solutions of Motion Coordination Problems. Proceed-

ings of ICTAI 2011, 701-708, IEEE Press, 2011.

Zhang, G. P.: Neural Networks for Classification: A Sur-

vey. IEEE Transactions on Systems, Man, and Cyber-

netics—part C: Applications and Reviews, Vol. 30 (4),

IEEE Press, 2000.

KEOD2014-InternationalConferenceonKnowledgeEngineeringandOntologyDevelopment

332