IMPLEMENTATION OF RESTRICTION

AND VALIDATION RULES

Olegas Vasilecas and Evaldas Lebedys

Vilnius Gediminas technical univercity, Sauletekio av. 11, LT-10223 Vilnius, Lithuania

Keywords: Data quality, Data validation, Restriction rules, Validation rules.

Abstract: The paper discusses implementation of restriction and validation rules. Restriction rules based and

validation rules based data cleaning approaches are discussed. The differences between restriction rules and

validation rules are distinguished. The paper presents a method for automated data validation rules

implementation in data cleaning procedures.

1 INTRODUCTION

At the moment variety of methods and commercial

tools are available that can be used to model

business systems and implement data integrity

constraints through the functionality of active

database management systems. Unfortunately, these

tools do not support data validation – the

implementation of business rules as integrity

constraints, triggers, stored procedures is used only

to avoid entry of erroneous data into the database.

Regardless the use of data quality checks at the entry

of data into the database, errors in data exist. The

application of business rules approach in data quality

assurance is widely discussed in the publications of

recent years. Currently only domain specific data

management tools support data validation, but these

tools do not support system modelling at all or are

suitable to model only some aspects of system.

Therefore, there are no tools that support both

system modelling and data validation. Errors in

software, unintended access to data and other

considerations may be the sources of errors in data.

These circumstances may be crucial for the quality

of data in certain domains, such as statistical data

processing, clinical trials or telecommunications.

Besides, even if the data are erroneous it may not be

changed or rejected at the entry in the database in

certain domains.

Section 1 introduces the paper. Section 2 briefly

presents restriction rules and validation rules. The

quality of data in the context of the clinical trials is

discussed in section 3. The purpose of restriction and

validation rules is analysed in section 4. Section 5

presents automated validation rules based approach.

Section 6 concludes the paper.

2 RELATED LITERATURE

The importance and criticality of data quality (DQ)

is widely discussed in recent publications

(Bertolazzi, 2001), (Motro, 1996). Many external

and internal factors impact the quality of data.

Constraints are implemented in most of information

systems, but errors in data still occur. The whole of

desirable data characteristics are analysed in the

context of data quality. Although, “fitness for use” is

the most widely adopted concept of data quality

(Strong, 1997), DQ is defined using various terms in

different application domains (Bertolazzi, 2001),

(Ehlers, 2003), (Jilovsky, 2005), (Wang, 1996). Data

quality can be defined as a lack of intolerable defects

in data (McKnight, 2004). Although, “fitness for

use” is the most widely adopted concept of data

quality. Different levels of data quality assurance are

implemented in most of systems, especially those

with high data quality importance.

At least two most important procedures are

implemented (Galhardas, 2001):

 Restriction of entry of ineligible data (Figure

1);

 Cleaning of collected data (Figure 2).

The work is supported by Lithuanian State Science and

Studies Foundation according to High Technology

Development Program Project "VeTIS" (Reg.No. B-

07042)

230

Vasilecas O. and Lebedys E. (2009).

IMPLEMENTATION OF RESTRICTION AND VALIDATION RULES.

In Proceedings of the International Conference on Health Informatics, pages 230-236

DOI: 10.5220/0001542202300236

 SciTePress

Data

collection

Paper

documents

Data entry

Data

examination

Error

reporting

Data

correction

Database

Data

representation

User

Figure 1: Data processing using restriction rules.

Validation

error reporting

Data

cleaning

Data

correction

Data

representation

User

Data

collection

Paper

documents

Data entry

Database

Figure 2: Data processing using validation rules.

Collected data may become invalid even if

restriction procedures are implemented, thus

constraints prohibiting the entry of erroneous data

are not enough. Data cleaning consists of three steps:

auditing of data to find violations, choosing

transformations to fix violations and applying the

transformations on datasets (Vijayshankar 2001).

Depending on the application domain, all steps of

data cleaning may vary: different methods of data

auditing may be used, different methods for data

transformation can be used. Rules for data auditing

have to be defined, in most cases. These rules are

used to validate date and identify discrepancies.

Data transformation is the next step and it is based

on automated correction of discrepancies or changes

in data requiring user interaction.

Implementation of both procedures, restriction of

entry of ineligible data and cleaning of collected

data, are based on rules (Galhardas, 2001). These

rules may be semantically identical, but the intent of

use and the way of implementation is different.

Depending on the intent of use and the way of

implementation these rules are classified into:

 Restriction rules;

 Validation rules.

Restriction rules are also called constraints in

software systems. Constraints in software systems

were used to avoid errors in data for many years.

Constraints are effective to avoid user errors and are

widely used up until now. Although, constraints are

implemented in most of information systems, errors

in data still occur. Constraints are not suitable to

eliminate errors that occur due to erroneous program

code, security problems and unexpected access to

the database, inaccurate mass data updates,

inadequate data representation (Akiyama, 2005),

(Ballou, 2006). Variety of methods and commercial

tools are available that can be used to model

business systems and implement data integrity

constraints through the functionality of active

database management systems. Validation rules are

used to examine collected data with intent to identify

data units non compliant with defined requirements.

The following are the active database

management system (ADBMS) features used to

implement restriction rules:

 Data types;

 Not null constraints;

 Lists and ranges of values;

 Multiplicity of relationships between tables;

 Primary and foreign keys;

 Referential integrity constraints;

 Stored procedures;

 Triggers.

Validation rules can be implemented in ADBMS

using stored procedures. All the other database

management system features are real-time based and

are proceeded at the time of data manipulation – data

entry, data update or data delete. As validation rules

are executed manually and react to real-time data

changes, these ADBMS features are not suitable for

validation rules implementation. Rule repositories

are also often used in parallel with stored procedures

for data validation.

IMPLEMENTATION OF RESTRICTION AND VALIDATION RULES

231

Table 1: The use of ADBMS features to assure data quality.

Data quality category ADBMS feature

Restriction

rules

Validation

rules

Integrity Data types

Completeness Not null constraints

Consistency, Integrity Lists and ranges of values

Completeness Multiplicity of

relationships between

tables

Consistency, Integrity Primary and foreign keys

Consistency, Integrity Referential integrity

constraints

Completeness,

Consistency, Integrity

Stored procedures

Completeness,

Consistency, Integrity

Triggers

3 PURPOSE OF RESTRICTION

RULES AND DATA

VALIDATION RULES

We distinguish restriction rules based data

processing and validation rules based data

processing. Restriction rules are implemented in

almost all systems and even in the systems based on

data validation, some restriction rules are used.

Restriction rules and validation rules are mostly used

to ensure the quality of data in respect to the

following data quality categories (Strong, 1997),

(Wang, 1996):

 Completeness;

 Consistency;

 Integrity.

Missing values and data inconsistencies have to

be identify to make data clean. Different types of

inconsistencies have to be considered when

designing both restriction and validation rules:

 Data inconsistency within one record;

 Data inconsistency within one table

 Data inconsistency within one database

 Data inconsistency within one different data

sources

Different types of inconsistencies influence

different data quality categories and different tools

are used to clean these inconsistencies (Table 1).

As already mentioned before, restriction rules

are triggering each time data are inserted, updated or

deleted. Restriction rules are based on an automatic

reaction to the changes in data state. Usually

validation rules are executed manually or at the

scheduled time points, but these rules never react to

data changes at the time data are changed. Thus, the

means suitable for implementation of restriction

rules and validation rules differ.

This paper is focusing on validation rules based

data processing and only implementation of

validation rules is discussed further.

4 DATA QUALITY IN CLINICAL

TRIALS

It is obvious that the means for data collection and

analysis in clinical trials have to be precise,

qualitative, verified and validated. These

requirements also stand for the applications used in

clinical trials for data inter-change, data entry, data

clarification, data records tracking, etc. The lack of

the system for gathering and managing all the

requirements for particular trial complicates the

control of quality. Requirements for particular

clinical trial may be expressed as business rules. The

use of business rules approach principles may

facilitate the control of the quality. This is especially

applicable to clinical trial applications. Trial related

knowledge and know-how knowledge stored in

business rules repository gives a broad view on a

whole of all requirements. The question how to

gather all the rules in to one repository arises here.

We propose to use the model of a clinical trial to

gather all the rules in to rules repository. The

modelling of a clinical trial may slightly prolong

clinical trial deign and may require additional

resources, but it is definitely advantaged. First of all

HEALTHINF 2009 - International Conference on Health Informatics

232

a graphical model of a clinical trials gives a broad

view on the organisation and the procedures of a

clinical trial. Besides, clinical trial model may be

suitable to capture trial related rules.

There is no general specification of all

requirements for clinical trial and it complicates

quality control. The model of the clinical trial is not

created during the design of the clinical trial mostly.

As a result of the clinical trial design a clinical trial

protocol is produced. The clinical trial protocol

presents all the information needed for the conduct

of the clinical trial, but the information is

represented in natural language. The use of natural

language for clinical trial description has both

negative and positive aspects:

 the positive aspect of the use of natural

language is clarity of the protocol for

everyone interested in the clinical trial. In

other words the protocol is understandable,

easy readable and does not require any special

knowledge;

 the negative aspect of the use of natural

language is ambiguity of natural language.

Natural language is informal and can be

interpreted. As clinical trial protocol is the

primary document for the conduct of clinical

trial it is desirable to have unambiguous

specification of all trial procedures.

The use of some formal or semi formal modelling

language for clinical trial modelling may allow

reduce the ambiguity of the protocol. But as there

are special requirements for the clinical trial protocol

and it has to be approved before the start of the trial,

it is impossible to present a model of the trial instead

of clinical trial protocol to the responsible

authorities. Thus a model of a clinical trial cannot

replace protocol. A model of a trial may be prepared

in parallel with the construction of protocol instead

of replacing the clinical trial protocol with the model

of the trial. It would be even better to start the design

of the trial from the model, but it may be impossible,

because the design of the model may prolong the

design of the study. Therefore the trial should be

modelled using any formal or semi formal language

to represent the procedures of the trial in a graphic

way in parallel with the design of the protocol or just

after the clinical trial protocol is created.

There are many modelling languages suitable to

represent different aspects of systems – UML, IDEF,

conceptual graphs, etc (Vasilecas, 2005). As UML

became the most popular modelling language for

any kind of systems in recent years, we analyse the

use UML for clinical trial modelling in this paper.

The Unified Modelling Language is a visual

language for specifying, constructing and

documenting the artefacts of systems. It is a general-

purpose modelling language that can be used with

all major object and component methods, and that

can be applied to all application domains (e.g.,

health, finance, telecom, aerospace) (OMG, 2006).

UML diagrams can be classified into three different

classes (Shen, 2005):

 diagrams describing the roles and obligations of

system users generally (Use Case diagrams).

In the clinical trial models these diagrams

should represent the roles and obligations of

the clinical trial team members and

participants. For example, the right to revoke

the patients informed consent or the obligation

of investigator to record medical history in the

Case Report Form can be represented in the

UML Use Case diagrams;

 diagrams describing structural system aspects

(class and object diagrams). In the clinical

trial model class diagrams should be used to

represent the organisation of a trial in detail.

For example, each examination, visit,

laboratory assessment, etc., should be

represented as classes with attributes and

operations. Class model may be used to create

the structure of the database for the clinical

trial data;

 diagrams describing the internal and external

behaviour of system (state transition diagrams,

sequence and collaboration diagrams). In the

clinical trial models these diagrams should be

used to represent the sequence of actions in

each step of a clinical trial. For example, the

proceeding of screening visit can be described

in sequence or collaboration diagrams and the

states of the patient diary can be represented

in state transition diagrams.

UML models are not fully formal. Some

information represented in UML diagrams can be

interpreted, but generally UML models are suitable

for automation of systems development. We assume

that the use of UML would greatly improve clinical

trial design and data validation processes. We

highlight the following main advantages of UML

usage for clinical trial research:

 UML model would give a broad graphical view

on the whole trial. This would improve quality

control and documentation of clinical trial

procedures;

 Duties and responsibilities of clinical trail team

members represented in UML Use Case

models would simplify preparation of

operational manuals for investigators and

other team members;

IMPLEMENTATION OF RESTRICTION AND VALIDATION RULES

233

Document

analysis

Requirement

identification

Requirement

specification

Implementation

of requirements

Program

code / SQL

ueries

Data

validation

Production of

error

messa

Error

messages

Data correction /

Error reduction

Data cleaning

Figure 3: Common data cleaning process.

 The organisation of clinical trial structural

components represented in UML Class

diagrams, can be used for clinical trial

database design;

 Representation of all requirements for valid

clinical trial data in one model would give a

broad view on all rules for data validation;

 The rules described in UML models for valid

data, may be retrieved from UML model and

placed into rules repository (Vasilecas, 2005).

It would simplify the extraction of rules from

source documents.

Because of the limited space of the paper clinical

trial modelling using UML is not discussed in detail.

The novelty of this paper is not the use of UML for

modelling the specific domain – clinical trials. The

aim is to present a way for automation of clinical

trial data validation and improve data clarification.

5 IMPLEMENTATION OF

VALIDATION RULES

Common data cleaning approach is based on manual

identification of system requirements and manual

implementation of data validation rules. All related

documentation is analysed to identify system

requirements. Requirements specification is the

basis for manual implementation of validation rules

expressed in requirements specification. Data

validation is executed then to identify data errors

and produce error messages (Figure 3).

It is always desirable to reduce the amount of

manual job, because automated activities are more

reliable. Previous analysis showed that rules may be

derived from systems models represented by UML

(Vasilecas, 2006). Derived rules can be used for

further analysis and automated implementation. The

Unified Modelling Language was chosen for

analysis, because it is a general-purpose modelling

language that can be used with all major object and

component methods, and that can be applied to all

application domains (OMG, 2006). UML diagrams

can be classified into three different classes (Shen,

2002):

 diagrams describing the roles and obligations of

system users generally (Use Case diagrams);

 diagrams describing structural system aspects

(class and object diagrams);

 diagrams describing the internal and external

behaviour of system (state transition diagrams,

activity diagrams, sequence and collaboration

diagrams.

Business rules in Use Case diagrams mostly

appear as statements describing system actors

competence boundaries and obligations. Business

rules in Use Case diagrams are represented relating

system functions with domain actors. Each Use Case

can be depicted in details using sequence and

collaboration diagrams. Sequence and collaboration

diagrams represent how system actors act and

exchange information to execute the tasks they are

assigned. Sequence and collaboration diagrams

include business rules describing the exact order of

actions to be executed to perform a task. State

transition diagrams are used to specify the sequences

of changes of states of business objects. Event-

Condition-Action (ECA) rules are mostly

represented in state transition diagrams. UML

activity diagrams can be used to model the logic of

the operations captured by a use case or a few use

cases. Activity diagrams represent both the basic

sequence of actions s well as the alternate sequence

of actions. ECA rules are mostly represented in

activity diagrams. Class diagrams include rules that

express constraints of business objects, properties of

HEALTHINF 2009 - International Conference on Health Informatics

234

Document

analysis

System

modeling

Requirement

specification

Implementation

of validation

rules

SQL queries

Data

validation

Production of

error

messa

Error

messages

Data correction /

Error reduction

System model

represented

using UML

nalysis of UML

model / Rule

identification

Rule

repository

Generation of

requirement

ecification

Data cleaning

Figure 4: Automated data cleaning process.

relationships between business objects. The rest of

UML diagrams. The rest of UML diagrams are used

to represent aspects of the development and

implementation of software systems and are not

analysed further.

An automated data cleaning approach based on

derivation of rules from system models represented

by UML is presented further (Figure 4). Commercial

case tools are storing UML model data in XML files

usually. In the current stage of research, only Sybase

PowerDesigner was chosen for analysis. The file

PowerDesigner produces is used for analysis and

identification of validation rules. Only the rules that

meet the rule representation templates are derived

from UML models. First of all the system parses

XML file that stores UML model specification. The

system looks for the following components in the

XML file:

 actors, use cases and associations that were

represented in Use Case diagrams;

 classes, attributes, operations and relations that

were represented in Class diagrams;

 objects and messages that were represented in

sequence and collaboration diagrams;

 start points, activities, transitions, decisions and

end points that were represented in activity

diagrams;

 start points, states, transitions, and end points

that were represented in state chart diagrams.

Each model component is copied to rule

repository. Repository data are indexed after all

model components are copied to rule repository. All

UML diagram components are merged using the

available rule templates. The rules in the repository

can then be used to generate natural language

requirement specification. SQL queries for data

validation are also generated using the rules in the

repository. Further steps of data cleaning a similar to

restriction rules based data processing. The main

difference between these two approaches is

automated generation of data validation queries

using the rules derived from system models.

Further analysis is focused on manual definition

of data validation rules in addition to the rules

derived from system models.

6 CONCLUSIONS

Analysis of recent researchers in data quality area

showed that data quality is relevant for each

organisation and due to its complexity is a

problematic research area. Implementation of

restriction rules is now always sufficient and

additional data cleaning procedures have to be

implemented to have data of a high quality. On the

basis of the previous research we decided that data

quality requirements might be derived from system

models represented by UML. Thus we proposed an

automated data validation rules based approach that

focuses on automatic derivation of data quality

requirements from system models. The proposed

method was implemented in software prototype and

was briefly presented in the paper.

IMPLEMENTATION OF RESTRICTION AND VALIDATION RULES

235

REFERENCES

Akiyama, I., Propheter, S. K. (2005) Methods of Data

Quality Control: For Uniform Crime Reporting

Programs. Federal Bureau of Investigation. URL:

http://www.fbi.gov/hq/cjisd/data_quality_control.pdf.

Ballou, D. P., Chengalur-Smith, I.N., Wang, R.Y. (2006)

Sample-Based Quality Estimation of Query Results in

Relational Database Environments. IEEE Transactions

on Knowledge and Data Engineering, Vol.18, No.5,

639-650.

Bertolazzi, P., Scannapieco, M. (2001) Introducing data

quality in a cooperative context. In the proceedings of

the Sixth International Conference on Information

Quality (IQ2001), Boston, MA, USA, 431-444

Ehlers, U.D., Goertz, L., Hildebrandt, B., Pawlowski, J.

M. (2005) Quality in e-learning: Use and

dissemination of quality approaches in European e-

learning. A study by the European Quality

Observatory, Luxembourg: Office for Official

Publications of the European Communities, URL:

http://www2.trainingvillage.gr/etv/publication/

download/panorama/5162_en.pdf.

Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita,

Ch. (2001) Declarative data cleaning: Language,

model, and algorithms. In the International Journal on

Very Large Data Bases(VLDB), 371–380.

Jilovsky, C. (2005) Data quality – what is it and does it

matter? Presented at Information Online 2005

Exhibition & Conference, Sydney, URL:

http://conferences.alia.org.au/

online2005/papers/c12.pdf

McKnight, W. (2004) Overall Approach to Data Quality

ROI. White Paper, Firstlogic Inc. URL:

http://www.oracle.com/technology/products/warehous

e/pdf/Overall%20

Approach%20to%20Data%20Quality%20ROI.pdf.

Motro, A. and Rakov, I. (1996) Estimating the Quality of

Data in Relational Databases. In Proceedings of the

1996 Conference on Information Quality, 94-106.

Object Management Group (OMG). (2006) Unified

Modeling Language (UML) Specification:

Infrastructure version 2.0. URL:

http://www.omg.org/docs/ formal/05-07-05.pdf.

Shen, W., Compton, K., Huggins, J. K. (2002) A Toolset

for Supporting UML Static and Dynamic Model

Checking. In proceedings of the 26th International

Computer Software and Applications Conference

(COMPSAC 2002), Prolonging Software Life:

Development and Redevelopment, Oxford, England,

IEEE Computer Society, 147-152.

Strong, D. M., Lee, Y. W., Wang, R. Y. (1997) Data

Quality in Context. Communications of the ACM,

Vol. 40, No.5, 103-110.

Vasilecas, O., Lebedys, E., Laucius, J. (2005) Repository

for Business Rules Represented in UML Diagrams.

Izvestia of the Belarusian Engineering Academy, V1

No. (19)/2, 187-192.

Vasilecas, O., Lebedys, E. (2006) Moving business rules

from system models to business rules repository.

INFOCOMP, V5, No 2, 11-17.

Vijayshankar, R. and Hellerstein, J.M. (2001)

Potter’sWheel: An Interactive Data Cleaning System.

In the International Journal on Very Large Data

Bases(VLDB), 381-390.

Wang, R. Y., Strong, M. D. (1996) Beyond accuracy:

What data quality means to data consumers. Journal of

Management Information Systems, Vol. 12, Issue 4, 5-

33.

Yin, T.C.T. and Chan, J.C.K. (1988) Neural mechanisms

underlie interaural time sensitivity to tones and noise.

In: W.E. Gall, G.M. Edelman and W.M. Cowans

(Eds.), Auditory Function: Neurobiological Bases of

Hearing. John Wiley, New York, pp. 385-430.

Smith, J., 1998. The book, The publishing company.

London, 2

edition.

HEALTHINF 2009 - International Conference on Health Informatics

236