Provenance and Formal Methods: The Case of Digital Image Processing

Carlos S´aenz-Ad´an

Departamento de Matem´aticas y Computaci´on, Universidad de La Rioja, Logro˜no, Spain

1 STAGE OF THE RESEARCH

There is a well know problem of balancing efﬁciency

and reliability when researchers attempt to combine

scientiﬁc computation with formal veriﬁcation of al-

gorithms. Usually, veriﬁed programs cannot compete

in performance with respect to applications in pro-

duction. Both aspects (efﬁciency and reliability) are

in particular very important in bioinformatics applica-

tions (for instance, in the context of biomedical image

processing).

With the aim of addressing these problems, this

research aims to set up an environment in which both

scientiﬁc computation with digital images and formal

veriﬁcation of their algorithms are combined using

techniques and standards of provenance. This envi-

ronment design could be able to facilitating the repro-

ducibility of the processes, and also to explaining the

reasons why that processing has been performed.

2 STATE OF THE ART

In the project entitled “Formalisation of Mathemat-

ics”, included in the 7th European program called

(ForMath), formal proof libraries were developed (us-

ing proof assistants such as Coq, Isabelle and ACL2)

in several mathematical ﬁelds. One of the main aims

of this project was the formalization of scientiﬁc com-

puting algorithms for increasing trust in Symbolic

Computation and Computer Algebra systems. In par-

ticular, the Spanish node of the project (led by the

University of La Rioja team) made relevant contri-

butions on homological processing of digital images

(Lamb´an et al., 2014) (Poza et al., 2014) and PhD the-

sis (Poza, 2013)).

An important characteristic of the project is that a

real collaboration with a biologists team, which stud-

ies synthesis of drugs for alleviating neurodegenera-

tive diseases (such as Alzheimer), provides us with

real examples for processing digital images. The

company SpineUp (SpineUp), led by the researcher

Miguel Morales, has posed problems from biomedi-

cal images that have been overcome by the Computer

Science group at University of La Rioja. These con-

tributions have been presented in the Spanish Neuro-

science conference (Mata et al., 2013) (Mata et al.,

2011). Some implemented algorithms used in these

developments were formally veriﬁed within the For-

Math project.

This research line which joins (scientiﬁc) compu-

tation and deduction (veriﬁcation of programs and al-

gorithms) has been recently granted by the Spanish

Government (project MTM2014-54151-P).

In order to understand our interest in these issues,

we must explain that the Computer Science group

from University of La Rioja develops different lines

of research. In this team, researchers from the ﬁelds

of scientiﬁc computing and formal methods in Soft-

ware Engineering coexist with others who come from

research in Information Systems, and more speciﬁ-

cally in the area of data and knowledge metamodel-

ing, and with others coming from the processes and

workﬂow management ﬁelds.

Modeling can be related to provenance. This re-

lationship comes from the concept of Occurrence-

Oriented system, which has been fruitfully exploited

in collaboration with the Noesis research group at

the University of Zaragoza, led by Eladio Dom´ınguez

(Dom´ınguez et al., 2014). Regarding process man-

agement, previous contributions are related to service

oriented architectures (SOA), speciﬁcally devoted to

ensure an agreed security policy covering the whole

chain, from the service establishment to its consump-

tion (Rodriguez-Priego and Garc´ıa-Izquierdo, 2007).

All these interests converge to the provenance

ﬁeld. This is a huge research topic, with fast growth

and not mature enough at this moment. In principle,

several ways have been considered to organize prove-

nance. On the one hand, the provenance of informa-

tion is considered (data-oriented workﬂows). It has

been studied in the case of formal theorem proving

(Ikeda et al., 2013) (which is related to the ForMath

project), in the case of execution of programs (Ch-

eney et al., 2011) (also using formal methods (Acar

et al., 2013)), and ﬁnally in the case of databases (Ch-

eney et al., 2009). Data provenance and workﬂow

provenance may be also distinguished (Buneman and

Sáenz-Adán C..

Provenance and Formal Methods: The Case of Digital Image Processing.

 2015 SCITEPRESS (Science and Technology Publications, Lda.)

Davidson, 2010). As discussed below, in our case

both data provenance (digital images) and the pro-

cesses involved in dealing with data (image process-

ing) are relevant.

3 RESEARCH PROBLEM

The research line applying formal methods in image

processing is being very successful but have some

drawbacks (well known in the international commu-

nity working in these topics). This weakness is re-

lated to how combining efﬁciency (required in real

applications, in particular in bioinformatics) with re-

liability (increased by verifying programs using proof

assistants). Reliability is an important property in

scientiﬁc computing (in particular, in the area of

biomedicine), but it conﬂicts with the pursuit of efﬁ-

ciency. It is well known that veriﬁed programs cannot

compete in speed with applications in production at

scientiﬁc laboratories (see, for example, (Poza et al.,

2014)).

In this research we want to reduce the distance be-

tween deduction and ﬁnal applications. To this aim,

we propose to use information system techniques,

in particular modern techniques coming from prove-

nance.

4 EXPECTED OUTCOME

Our proposal, which to the best of our knowledge is

new in the literature, would consist of including in

a single network of provenance both causal chains

(in other words, those that are producing the vari-

ous transformations in digital images) and chains of

arguments (those that explain why a certain process

has been implemented). This would not only facil-

itate the reproducibility of experiments (one of the

explicit aims of provenance), but also reproduce the

reasons why different workﬂow steps have been de-

signed, making it easier for an external agent the un-

derstanding of the process as a whole.

On the more basic case, a conceptual relationship

of provenance could be a reference to an external doc-

ument (for example, a technical report, a well-known

algorithm or a journal paper) that explains why the as-

sociated functional process is consistent (with respect

to the workﬂow objectives). In a more formalized

context, the reference may contain a formal proof on

a theorem prover (Coq, Isabelle, ACL2, etc.) showing

that the algorithm is correct. Finally, in addition to a

formal proof, a veriﬁed program could be available

as an “explanatory artifact”. This certiﬁed program

(whose correctness is ensured by means of a theorem

prover) could be applied to obtain results that can be

compared against the outputs obtained in the comput-

ing part of the provenance net.

In a paradigmatic case, an agent could consult the

network of provenance to know what programs are

applied for a certain image (including the query of

previous images used as parameters, if any), may re-

quest the re-calculation (with the selected software

platform) and in addition (and this is the most inno-

vative aspect of our proposal) one could make an au-

tomated testing against a veriﬁed program that could

show that the production program is appropriate in

that particular case. Since the veriﬁed program can

be (and, in general, will be) more inefﬁcient than the

programs in production, one can consider performing

off-line testing, with the aim of not harmingthe agility

of the reproduction of the functional part of the exper-

iment.

It should be noted that our proposal avoids the

problem (intractable with the current state of the tech-

nology) of formal veriﬁcation of algorithms of scien-

tiﬁc computing in production, since we do not look

for ensuring that the operational program is equiva-

lent, for all inputs, to the veriﬁed program. However,

it allows us an ﬂexible and uncoupled integration of

formal methods that would increase trust in the ex-

periment as a whole.

5 OUTLINE OF OBJECTIVES

Based on the problems identiﬁed above, we have de-

ﬁned the following four objectives:

• Objetive A. Deﬁnition of a setting where deduc-

tion and processing provenance coexist.

• Objetive B. Devising of a representation language

which integrates (scientiﬁc) computing and de-

duction (veriﬁed algorithms and programs).

• Objetive C. Set up a query language over the pre-

vious representation.

• Objetive D. Develop a proof of concept. We are

going to use the particular case of homological

processing of images, with a prototype which is

going to be useful to provide a provenance net.

This is going to include both already-made formal

proofs and ex-novo proofs.

6 METHODOLOGY

The proposed project is multidisciplinary, hence, we

will use, from a methodological point of view, tech-

ICSOFT2015-DoctoralConsortium

niques and methods from different scientiﬁc ﬁelds. It

will be necessary to use the most appropriated one in

each stage of development.

For instance, the basic techniques of literature

search can be combined with more advanced tech-

niques as the systematic review (Kitchenham et al.,

2002). When tasks related to mathematics are ad-

dressed, we have to use more formal methods. That

may be combined with methods from mechanized

theorem proving. Since the project will likely require

software development and systems integration, it will

also be necessary to apply methods from the design of

information systems and software engineering, such

as requirements analysis and conceptual modeling.

6.1 Stages

Each stage corresponds with one objective, and has

been split in several sub-stages.

• E1. Deﬁne a contextual environment with the aim

of integrating deduction and computing prove-

nance.

• E1.1. Study previous works in the research group,

related both to formal veriﬁcation of algorithms

and to information systems.

• E1.2. Systematic review of the literature related

to provenance.

• E1.3. Study of the expressiveness of different pro-

posals in the literature, trying to adapt some of

them (or a mixture of several ones) to achieve our

objectives.

• E1.4. Set up a semi-formal deﬁnition of a prove-

nance model which allows integrating the work-

ﬂow of a process from a functional perspective,

together with explanations describing why the

process has been produced in that way.

• E2. Set up a formal deﬁnition of a language which

represents models corresponding to the previous

stage.

• E2.1. Study of different languages inside the lit-

erature to represent provenance networks (at least

PLM (Del Rio et al., 2010), OPM (Moreau et al.,

2011) and W3C Prov (Missier et al., 2013)).

• E2.2. Propose a representation language with the

aim of dealing with the second objective.

• E3. Deﬁne a query and deﬁnition language for

networks constructed with the previous represen-

tation language.

• E3.1. Analyze the available tools (in particular de-

veloped by the group, such as RCM (Rodriguez-

Priego et al., 2013)) for managing data and pro-

cesses.

• E3.2. Formal deﬁnition of a query language for

provenance networks.

• E4. Development of a prototype which will use

the proposals and deﬁnitions mentioned above.

• E4.1. Deployment, in a particular network, of

some of the processes already developed for the

manipulation of biomedical images, including

formal proofs with Isabelle / HOL, Coq or ACL2.

• E4.2. Development of new features with formal

proofs.

• E4.3 Justifying that the prototype can also inte-

grate new sources of data and arguments devel-

oped in the previous stage.

6.1.1 Schedule

Based on the objectives, a doctoral planning has been

done. It has been divided into four stages, each one

corresponding to one year.

Throughout the PhD planning there are in addi-

tion tasks on coordination and supervision meetings

with thesis advisors and other members of the re-

search group. It is also foreseen the participation in

training courses and conferences to expose partial re-

sults obtained.

Furthermore, there will be tasks related to doc-

umentation generation (internal reports, journal and

proceeding papers) and to the development of pro-

grams and formal poofs.

REFERENCES

Acar, U. A., Ahmed, A., Cheney, J., and Perera, R. (2013).

A core calculus for provenance. Journal of Computer

Security, 21(6):919–969.

Buneman, P. and Davidson, S. B. (2010). Data provenance–

the foundation of data quality. )ˆ(Eds.):‘Book Data

provenance–the foundation of data quality’(2013,

edn.).

Cheney, J., Ahmed, A., and Acar, U. A. (2011). Provenance

as dependency analysis. Mathematical Structures in

Computer Science, 21(06):1301–1337.

Cheney, J., Chiticariu, L., and Tan, W.-C. (2009). Prove-

nance in databases: Why, how, and where, volume 4.

Now Publishers Inc.

Del Rio, N., da Silva, P. P., and Porras, H. (2010). Browsing

proof markup language provenance: Enhancing the

experience. In Provenance and Annotation of Data

and Processes, pages 274–276. Springer.

Dom´ınguez, E., P´erez, B., Rubio,

A. L., Zapata, M. A., Lav-

illa, J., and Allu´e, A. (2014). Occurrence-oriented de-

sign strategy for developing business process monitor-

ing systems. Knowledge and Data Engineering, IEEE

Transactions on, 26(7):1749–1762.

ProvenanceandFormalMethods:TheCaseofDigitalImageProcessing

ForMath. http://wiki.portal.chalmers.se/cse/pmwiki.php/

ForMath/ForMath.

Ikeda, R., Das Sarma, A., and Widom, J. (2013). Logical

provenance in data-oriented workﬂows? In Data En-

gineering (ICDE), 2013 IEEE 29th International Con-

ference on, pages 877–888. IEEE.

Kitchenham, B. A., Pﬂeeger, S. L., Pickard, L. M., Jones,

P. W., Hoaglin, D. C., El Emam, K., and Rosenberg, J.

(2002). Preliminary guidelines for empirical research

in software engineering. Software Engineering, IEEE

Transactions on, 28(8):721–734.

Lamb´an, L., Rubio, J., Mart´ın-Mateos, F.-J., and Ruiz-

Reina, J.-L. (2014). Verifying the bridge between sim-

plicial topology and algebra: the eilenberg–zilber al-

gorithm. Logic Journal of IGPL, 22(1):39–65.

Mata, G., Cuesto, G., Morales, M., Rubio, J., and Heras, J.

(2011). Synapcountj: un software para el estudio de

la densidad sin´aptica. In XIV Congreso de la Sociedad

Espa˜nola de Neurociencia (SENC 2011). http://

www.senc2011.com/docs/programa

senc2011.pdf.

Mata, G., Fern´andez, P., Romero, A., Rubio, J., Cuesto,

G., and Morales, M. (2013). Nucleusj: desar-

rollo de un plugin en ﬁji para el an´alisis de mod-

elos de muerte neuronal. In XV Congreso de la

Sociedad Espa˜nola de Neurociencia (SENC 201).

http://www.senc2013.com/.

Missier, P., Belhajjame, K., and Cheney, J. (2013). The w3c

prov family of speciﬁcations for modelling prove-

nance metadata. In Proceedings of the 16th Inter-

national Conference on Extending Database Technol-

ogy, pages 773–776. ACM.

Moreau, L., Clifford, B., Freire, J., Futrelle, J., Gil, Y.,

Groth, P., Kwasnikowska, N., Miles, S., Missier, P.,

Myers, J., et al. (2011). The open provenance model

core speciﬁcation (v1. 1). Future Generation Com-

puter Systems, 27(6):743–756.

Poza, M. (2013). Certifying homological algorithms to

study biomedical images. PhD thesis, Universidad de

La Rioja.

Poza, M., Dom´ınguez, C., Heras, J., and Rubio, J. (2014).

A certiﬁed reduction strategy for homological im-

age processing. ACM Transactions on Computational

Logic (TOCL), 15(3):23.

Rodriguez-Priego, E. and Garc´ıa-Izquierdo, F. J. (2007).

Securing code in services oriented architecture. In

Web Engineering, pages 550–555. Springer.

Rodriguez-Priego, E., Garc´ıa-Izquierdo, F. J., and Rubio,

A. L. (2013). References-enriched concept map: a

tool for collecting and comparing disparate deﬁnitions

appearing in multiple references. Journal of Informa-

tion Science, page 0165551513487848.

SpineUp. http://spineup.es.

ICSOFT2015-DoctoralConsortium