A Comparative Study
Bernardo Marques
, Eliana Sousa
, Tiago Silva-Costa
, Ricardo Correia
and Alberto Freitas
Department of Health Information and Decision Sciences, Faculty of Medicine, University of Porto, Porto, Portugal
CINTESIS - Center for Research in Health Technologies and Information Systems, University of Porto, Porto, Portugal
Keywords: Data quality problems, Administrative data, Hospital information systems.
Abstract: This paper is a preliminary study over the problems resulting from the integration of a departmental
information system database over a central database. This work will allow the comparison between the
quality of the data collected for clinical purposes by a medical department, and the data collected for
administrative and epidemiological purposes in a central hospital database. It is expected that the different
purposes for these two data collections can have an impact on data consistency, namely on it completeness
and detail of information, among other data quality problems. We expect to detect the type of variables that
are better recorded in each information system, by calculating and comparing the quality of similar
variables. We also expect to detect differences between both systems in the registries of the same patients.
This paper can play an important role for better understanding the quality of the integration of departmental
systems in the general hospital information system, pointing out some limitations about consistency and
information flow. It is also our goal to suggest some recommendations and strategies to prevent data quality
problems and to improve communication between central and departmental databases.
Over the past years we have been witnessing an
improvement of medical registries along with the
development of even more capable digital systems
and warehouse capacity. The exponential growth of
information has led to an intensification of interest
in exploring the information collected, not only for
clinical decisions and research studies but also for
hospital management. The information value is
strongly dependent on the quality of the data
contained in the registry (Arts et al., 2002).
Therefore, studies regarding data quality are now
even more relevant as the utilization of these
databases increase in magnitude and importance
(Freitas et al., 2010b). Particularly, in Portugal,
many efforts have been done to study the scale of
data quality issues over hospital databases and their
implications to decision makers, administrators and
researchers (Freitas et al., 2010a); (Silva-Costa et al.,
2007); (Silva-Costa et al., 2010).
Regarding central databases in health care arena,
the Portuguese National Health Service (NHS) has,
since 1990, a system called SONHO for the
management of hospital patients. This system allows
the registry of patients and departments, as
pharmacy, blood or surgery, and is used in all NHS
public hospitals. The integration of this system had a
positive impact both in productivity and
improvement of diagnostic techniques (Dismuke and
Sena, 1999). This system have been collecting data
systematically as patients flow over the Portuguese
public hospitals, gathering huge amounts of data
ready to be explored.
Apart from the fact that SONHO is not
accessible to every health professionals or
researchers, one problem of this system is that the
database model is so complex and the amount of
data is so big that studies over this information are
as yet quite limited. Thereby directors and staff from
hospital departments have been working with
developing teams implementing and integrating
different information systems over SONHO (Cruz-
Correia, 2010). There are multiple advantages for
the integration of these systems, namely the easy
access to the collected information, different
database structure and more specific information for
Marques B., Sousa E., Silva-Costa T., Correia R. and Freitas A..
DOI: 10.5220/0003759901950200
In Proceedings of the International Conference on Health Informatics (HEALTHINF-2012), pages 195-200
ISBN: 978-989-8425-88-1
2012 SCITEPRESS (Science and Technology Publications, Lda.)
clinical purposes. The integration of these
information systems has also some inherent risks or
disadvantages in particular if the communication
with the central system (SONHO) is not as effective
as it should be. This fact can lead to several data
quality problems and/or to two different sets of data
instead of two sets with the same information but
with different purposes (Cruz-Correia et al., 2006).
This paper is a preliminary study over the
referred problems resulting from the integration of a
departmental system database over a central
administrative database. This work should allow the
comparison between the quality of the data collected
by medical departments for clinical purposes, and
the data collected in central hospital databases for
administrative and epidemiological purposes. It is
expected that the different purposes for these two
data collections have an impact on data consistency,
namely on it completeness, detail of information and
other quality problems.
We expected to detect the type of variables that
are better recorded in each information system, by
calculating and comparing the quality of similar
variables. We also expected to detect differences
between both systems in registries of the same
This paper can play an important role for better
understanding the quality of the integration of
departmental systems in the general hospital
information system, pointing out some limitations
about consistency and information flow. It is also
our goal to suggest some recommendations and
strategies to prevent data quality problems and to
improve communication between central and
department databases.
This study has been developed at Hospital São João
(HSJ), one of the biggest Central Hospitals of
Portuguese NHS. It is also a teaching hospital where
research teams develop and integrate numerous
information systems at different hospital
departments. The majority of information systems
are integrated with the central system SONHO,
therefore this study aims to evaluate, measure and
compare the data quality between SONHO and the
information systems available at HSJ. The study
started by selecting the departmental information
systems to be considered (e.g.: obstetrics, intensive
care, pneumology and haematology). Then, common
variables to the departmental and the central
database will be studied to check consistency and
other quality issues.
As referred before, this paper presents
preliminary results of a comparative study; that is,
the results presented focus over one of the available
information systems.
The information system selected was ObsCare
(VirtualCare), an application running on obstetric
department to register and manage all obstetric
episodes occurred. This system was integrated in
HSJ obstetric department in 2004 and since then it is
collecting daily data, namely from parturient and
In this study we used a simple method. We
started with an individual analysis over each field at
each table in each system. This first process aimed
to evaluate the individual data quality in both
systems. After this characterization and after
understanding the individual problems, we have
merged equal tables from both systems so the
comparison field by field could start. The merging
process was made based on patient’s sequential
To the individual analysis we have selected all
episodes registered in both systems. As SONHO has
been collecting data since 1997 and ObsCare only
started in 2004, the number of observations in each
table will be greater in SONHO. Therefore, for the
comparison, we will be analysing only the common
registries in both systems, i.e., episodes from
January 2004 until 20 of July 2011.
To run the study over these information systems,
authorizations were granted by the obstetric
department’s director. The research team involved in
this analysis consisted in informatics specialists,
developers of the ObsCare system and statisticians.
The tools used in this study were Excel 2007 and
SPSS 19.
In this section we will present the results of our
preliminary study. We will start presenting the
results of the individual analysis of data quality over
each table.
Table 1 shows the results of the individual
analysis over the episodes table of both systems.
Even with a larger set of data in the episodes
table, in SONHO no data quality problems were
detected. On the other hand the same table in
ObsCare evidences some quality problems. The
problem with more expression is the missing values
in variable administrative discharge date. As
HEALTHINF 2012 - International Conference on Health Informatics
showed in Table 1, all registries in this variable have
missing value. After checking with the developers of
ObsCare we understood that this variable is not
filled in this system. The reason for this is simple,
when a patient leaves the obstetric department the
clinical discharge date is registered in ObsCare. The
administrative discharge date is filled by the
administration staff in SONHO when the patient
leaves the hospital. This happens because this
patient can be admitted in other departments after
leaving the obstetric department and before leaving
the hospital. The problem is that ObsCare does not
receive any information from SONHO about the
patient after he leaves the department. Thus the
system does not know the patient’s administrative
discharge date. This is not a data quality problem but
an integration problem.
Table 1: Data quality problems observed in episodes
Data quality problems
N % N %
Episodes table
Total registries
30,985 - 51,410 -
Missing admission
65 0.21 - -
Missing clinical
discharge date
2,853 9.21 - -
discharge date
30,985 100 - -
admission responsible
2,006 6.47 - -
discharge responsible
3,000 9.68 - -
Nevertheless, some real quality problems were
detected in other variables. 65 (0.21%) registries
with no admission date were detected. It should not
be possible to register any patient in the system
without filling this variable. The same was observed
with the clinical discharge date, 2 853 (9.21%)
registries with missing values. Other detected
problems were the missing or invalid values in
variables related to the admission responsible and
discharge responsible. In these variables a numeric
code is registered identifying the doctor responsible
for admission/discharge. Those cases, which are
filled with 0, are considered invalid. Thus, we
detected, in admission and discharge responsible,
2 006 (6.47%) and 3 000 (9.68%) missing/invalid
values respectively. Once again it reveals that no
mechanisms are used in ObsCare to validate or
control this process. In these variables, zero or blank
values should not be accepted.
In Table 2 we present the observed data quality
problems in identification tables.
The most relevant result that we can extract form
Table 2 is the missing values for the variable patient
number. This is an important variable for the
identification of the patient and both systems present
a high percentage of missing values. Other variables
like contact or marital status also present high
number of missing values but in these cases they are
not as important for the identification/notification of
patients, nevertheless these are data quality problems
that should not occur in these systems.
The missing values detected in the post code can
be easily explained. In some cases administrative
staff filled wrongly the post code as part of the
address variable. However this problem should be
avoided for a better quality of data for future
analysis or usage.
In ObsCare we also detected some cases of
missing values in birth date. There are few cases but
the validation mechanisms should not let this
In ObsCare, as we can see in Table 2, there are 2
cases of missing gender, 1 of missing patient name,
28 cases of missing process number and 4 of
missing address. SONHO also has 41 missing
values in the address variable. Their occurrence is
marginal but can work as alerts for problems with
the system for future versions.
Table 2: Data quality problems observed in identification
Data quality problems
N % N %
Identification table
Total registries
23,994 - 35,966 -
Missing patient
3,390 14.1 5,111 14.2
Missing birth date 111 0.46 - -
Missing post code 89 0.37 113 0.31
Missing contact (tel.) 10,919 45.5 5,772 16.1
Missing marital status 790 3.29 1,099 3.06
Missing gender 2 0.01 - -
Missing address 4 0.02 41 0.11
Missing name 1 0.00 - -
Missing process
28 0.12 - -
Other problems, not presented in Table 2 but that
focused our attention, were some inconsistencies in
several values. For example, if we are analysing
only obstetric episodes and the identification table
just register the identification of the parturient
(female) all registries should have female as gender.
However, we have detected 15 registries of males in
ObsCare and 14 in SONHO. Another inconsistence
detected with the patient’s gender is that this is a
numerical variable registered in the database with 1
(Male) and 2 (Female), but we have detected in
ObsCare 275 (1.15%) representations with ‘F’. This
is truly an inconsistence but in this case it is a
database problem.
In marital status the possible values are: single,
married, divorced, widow and cohabiting couples.
We have detected 871 (3.6%) cases registered with
‘Unknown’, 115 (0.5%) with ‘Other’ and these are
not possible values in the form field for this variable.
In addition, 14 cases completely out of standard
were detected.
Table 3 summarizes the data quality results for
the newborn tables. Analysing the detected problems
for apgar variables, it is evident the lack of registries
for these variables in SONHO, as the missing values
are 9 228 (25.2%) for apgar1, and 8 887 (24.27%)
for apagar5. In ObsCare, only apgar10 score has a
high percentage of missing values revealing that the
tenth minute measure is not as important as the other
two measures.
Table 3: Data quality problems observed in newborn
Data quality
problems detected
N % N %
Newborn table
Total registries
21,225 - 36,611 -
Missing delivery
type description
305 1.44 - -
Missing son
inpatient number
- - 88 0.24
Missing son
sequential number
- - 81 0.22
Missing Apgar1 58 0.27 9,228 25.2
Missing Apgar5 65 0.31 8,887 24.3
Missing Apgar10 11,885 56.0 - -
Invalid delivery
305 1.44 - -
Invalid fetal
305 1.44 - -
Again in newborn tables, as we verified in
identification tables, there are variables with invalid
values. The delivery type, for instance, is a string
variable with possible values: ‘Eutocic’, ‘Forceps’,
‘Vacuum’, ‘Cesarean’, ‘At home’, ‘In Pré-hospital
transportation’ and ‘Unknown’. We detected, in this
variable, 305 (1.44%) registries with different
representations than those listed. As a result those
305 registries have missing values in the delivery
type description variable because the database does
not have correspondence for these delivery types.
The same happens with the variable fetal
presentation where 305 registries with invalid values
were detected. In addition, we have detected other
problems such as 1 registry with 0 weight in
ObsCare and 2 registries in SONHO.
As the individual analysis of tables in both
systems is complete the next phase is to compare
registries between both systems. For this comparison
missing values will be excluded. Before presenting
the results of the comparison it is important to refer
some differences between variable representations in
both systems. For example the variable fetal
presentation has in SONHO the possible values: ‘T’
(Transverse), ‘C’ (Cephalic) and ‘P’ (Pelvic) while
in ObsCare instead of ‘T’ there’s an ‘E’ for
Espádua’. The two terms have the same meaning
but in a database architecture point of view the same
values should be used in both systems.
The variable gender
in the ObsCare newborn
table is of string type with values ‘F’ and ‘M’ while
in SONHO the same variable is numeric with values
2 and 1 respectively. A similar problem was detected
in the variable delivery type and respective
description. In ObsCare the delivery type is a string
variable while in SONHO it is numeric and the
possible values are different.
Even inside the same system there are different
representations for the same variables. In ObsCare,
the variable gender in the identification table is
numeric while in newborn’s table, as already
referred, is a string variable.
For the comparison between identification tables,
cases were merged based on their sequential
number. During this process we have detected
several registries in ObsCare with no
correspondence in SONHO and vice-versa. At total
2 101 of these registries were detected in ObsCare
and 142 in SONHO. These cases were also excluded
from the comparison results presented in Table 4. As
we can observe in Table 4 only 21 893 of the 23 994
registries from identification table in ObsCare were
considered common in both tables.
In the common registries we detected 3 575
(16.33%) differences in the contact number
registered in both systems. Differences were
detected also in patient number and process numbers
with 2 913 (13.31%) and 2 503 (11.43%) cases
The highest differences were detected in address
and marital status. The differences between marital
status can be partially explained due to the different
possible values for this variable in both systems.
HEALTHINF 2012 - International Conference on Health Informatics
With the address it is not so simple to explain the
detected differences without a specific tool to
measure character string differences.
Other detected differences were observed in the
variable names. With a lookup process to measure
the differences it was possible to check that most
cases differ because of a single surname. In SONHO
the majority of the names in these cases appear with
one more surname than in ObsCare’s registries. We
also detected some misspelling errors or differences
in some letters of the names.
Table 4: Identification’s tables comparison.
N %
Total common registries 21,893 -
Different contact (tel.) 3,575 16.3
Different patient number 2,913 13.3
Different process number 2,503 11.4
Different names 1,088 4.97
Different address 8,689 39.7
Different gender 1 0.00
Different birth date 189 0.86
Different post code 2,405 11.0
Different marital status 6,450 29.5
In Table 5 we can find a summary of the results
for the comparison between newborns tables in both
systems. As in the comparison of identification
tables, we detected some cases where the merging
process could not join both tables. In total, 850
ObsCare registries have no correspondence in
SONHO and 210 newborn registries from SONHO
have no correspondence in ObsCare.
Table 5: Newborn’s tables comparison.
N %
Total common registries 20,375 -
Different delivery type 1,161 5.7
Different delivery type
1,162 5.7
Different fetal presentation 563 2.8
Different birth date 1,594 7.8
Different weight 364 1.8
Different Apgar1 178 0.9
Different Apgar5 189 0.9
Different gender 152 0.7
Different live born (Y/N) 8 0.0
In Table 5 it is possible to observe that 1 594
(7.8%) cases have different birth date registered in
both systems. In delivery type and respective
descriptions, although the differences in possible
values referred before, during this comparison
process we have forced that similar delivery type
values where considered the same. For example, we
forced the correspondence between ‘Eutocic’ in
ObsCare and SONHOS’s values ‘Eutocic – Twins’,
‘Eutocic – Pelvic’ and ‘Eutocic’. All other values
were forced likewise when possible. So, the detected
differences for these two variables are effective
differences, more precisely 1 161 (5.7%) for delivery
type and 1 162 (5.7%) for the respective description.
The same technique of forcing equalities was
used for the variable fetal presentation, but in this
case it was only necessary to force the ‘E’ in
ObsCare to be the same as ‘T’ in SONHO, for the
reasons already explained before. Even though we
have detected 563 (2.8%) differences in fetal
presentation. With not as much significance as the
already mentioned differences, but with no less
importance, there is the difference between weights
with 364 (1.8%) cases, apgar1 with 178 (0.9%) and
apgar5 with 189 (0.9%) cases. These are numeric
values measured only once, so it is hard to
understand the reasons why these values have
differences in both systems. We also detected
differences in the registries of the newborn’s gender
and in 8 cases the registries do not match in the
variable live birth.
Next, we present the last comparison table with
the results of the comparison between the episode
tables from both systems.
Table 6: Episode’s tables comparison.
N %
Total common registries 24,971 -
Different discharge date 1,509 6.0
Different admission date 1,182 4.7
Different admission responsible 4,198 16.8
Different discharge responsible 2,688 10.8
As we can observe in Table 6 there are many
differences between date variables in both systems.
The discharge date variable presents 1 509 (6.0%)
cases of difference in registries and in admission
date we have detected 1 182 (4.7%) differences.
Also in the admission and discharge responsible we
verified a high percentage of differences. These
results show clearly some issues in the
communication between the involved systems.
With the results presented in this paper it is clear
that there are some issues needing improvement so
the integration process can be as reliable and
consistent as possible. At the end we think that these
two systems work in an individual way and in fact
there is no real integration between them. All
registries are duplicated, i.e., each registry is
introduced manually in both applications by
different health professionals. That is a big concern
in terms of data quality as this process can lead to
different registries and even duplication of errors.
This would be avoided if the communication
between the systems was more effective reducing
the source of errors.
By analysing the results of the individual quality
of data produced by both systems, it is possible to
understand that ObsCare need additional validation
tools. In fact, there are tools implemented in this
system but, as we observed in the presented results,
they are not being as effective as desired. However it
is patent that ObsCare, because of his purpose, has
more detailed data, but not in a consistent and
complete way. There is considerable amount of
missing data, some variables have invalid values
registered and, as we verified, there are different
representations for the same variables.
The central system SONHO evidences less
interest in collecting some specific variables as they
are not as important for the system purpose.
Nevertheless some detected data problems can be
very useful to call the attention of the NHS so they
can change the way data are collected, improving his
completeness, consistence and detail.
Through the comparison, differences are clear
between both systems. The differences were
detected in every variable and table analysed. This
proves that the integration failed as there is no really
interaction. A better communication between both
systems could conduct to more reliable information
and save time in the introduction of data so that
health professionals can have more time to be
focused on patients and on research.
This is a preliminary study, and so all results
collected and presented will be further explored
during our future work. In the next steps of our
research we will be working with developers to test
and improve their validation tools and to implement
an application to scan all data and check for these
and other data quality problems. We would also like
to extend this study to other departmental systems
working at the HSJ central hospital.
The authors would like to thank the support given by
the research project HR-QoD – Quality of data
(outliers, inconsistencies and errors) in hospital
inpatient databases: methods and implications for
data modelling, cleansing and analysis (project
PTDC/SAU – ESA /75660/ 2006).
Arts, D. G., De Keizer, N. F. and Scheffer, G. J., 2002.
Defining and improving data quality in medical
registries: a literature review, case study, and generic
framework. J Am Med Inform Assoc, 9, 600-11.
Cruz-Correia, R., Vieira-Marques, P., Ferreira, A.,
Oliveira-Palhares, E., Costa, P. and Costa-Pereira, A.,
2006. Monitoring the integration of hospital
information systems: How it may ensure and improve
the quality of data. Stud Health Technol Inform, 121,
Cruz-Correia, R. J., 2010. Implementation, monitoring and
utilization of an integrated Hospital Information
System--lessons from a case study. Stud Health
Technol Inform, 160, 238-41.
Dismuke, C. E. and Sena, V., 1999. Has DRG payment
influenced the technical efficiency and productivity of
diagnostic technologies in Portuguese public
hospitals? An empirical analysis using parametric and
non-parametric methods. Health Care Manag Sci, 2,
Freitas, A., Marques, B., Silva-Costa, T., Lopes, F.,
Garcia-Lema, I. and Costa-Pereira, A. Year. Data
Quality issues in DRG databases. In: 26th PCS
International Conference, 2010a Munich.
Freitas, A., Silva-Costa, T., Marques, B. and Costa-
Pereira, A. Year. Implications of data quality problems
within hospital administrative databases. In: 12th
mediterranean conference on medical and biological
engineering and computing – medicon 2010, 27-30
May 2010b Porto Carras, Chalkidiki, Greece.
Silva-Costa, T., Freitas, A., Jácome, J., Lopes, F. and
Costa-Pereira, A. Year. A eficácia de uma ferramenta
de validação na melhoria da qualidade de dados
hospitalares. In: CISTI - 2ª Conferência Ibérica de
Sistemas e Tecnologias de Informação, 21 a 23 de
Junho 2007 Porto.
Silva-Costa, T., Marques, B. and Freitas, A. Year.
Problemas de Qualidade de Dados em Bases de Dados
de Internamentos Hospitalares. In: 5ª Conferência
Ibérica de Sistemas e Tecnologias de Informação, 16 a
19 de Junho 2010 Santiago de Compostela.
VirtualCare. VCOBS.GYN - ObsCare [Online]. Available:
bsgyn-eng.html [Accessed].
HEALTHINF 2012 - International Conference on Health Informatics