ANALYSIS-SENSITIVE CONVERSION OF ADMINISTRATIVE

DATA INTO STATISTICAL INFORMATION SYSTEMS

Mirko Cesarini, Mariagrazia Fugini,

Politecnico di Milano, Dipartimento di Elettronica e Informazione

Via Ponzio, 34/5 I-20133 MILANO, Italy

Mario Mezzanzanica

Universit

a degli Studi di Milano-Bicocca, Dipartimento di Statistica

Via Bicocca degli Arcimboldi 8, I-20126 MILANO, Italy

Keywords:

Statistical Information Systems, Taxation Archives, Decision Support Systems, Data Quality, Integrating Het-

erogeneous Data Sources, Data Warehouse.

Abstract:

In this paper we present a methodological approach to develop a Statistical Information System (SIS), out of

administrative archives of the Public Administrations. Such archives are a rich source of information, but an at-

tempt to use them as sources for statistical analysis reveals errors and incompatibilities that do not permit their

usage as a statistical and decision support basis. The proposed methodological approach encompasses build-

ing a SIS out of administrative data, such as design of an integration model for different and heterogeneous

data sources, improvement of the overall data quality, removal of errors that might impact on the correctness

of statistical analysis, design of a data warehouse for statistical analysis, and design of a multidimensional

database to develop indicators for decision support. We present a case study, the AMeRIcA Project.

1 INTRODUCTION

Public Administrations (PA) are facing institu-

tional and organizational changes requiring managers,

stakeholders, and politicians to increase quick deci-

sion making processes. A key role is assumed by

the development of Statistical Information Systems

(SIS) aimed at providing support for decisions, analy-

sis, monitoring, and control activities. In particu-

lar, data deriving from administrative sources (e.g.,

government registries, tax registries) assume a basic

value to gather information concerning the commu-

nity and to feed the SIS. However, administrative data

are often incorrect and unsuitable to be used for sta-

tistics and decision making. Hence, they need to be

cleaned up from errors, and pre-processed before be-

ing reversed into statistical databases. This paper il-

lustrates the AMeRIcA project (Anagrafe Milanese e

Redditi Individuali con Archivi - Milan Registry Of-

ﬁce and Individual Income with Archives), where the

administrative archives available from the Registry

Ofﬁce of the Milan Municipality and of the Italian In-

come Ofﬁce are used to derive statistical information

about actual income of subjects and families in Milan.

Some experiences show that the integrated use of tax-

related databases together with Registry databases en-

ables to obtain rich information (Statistics Denmark,

2000). In such streamline, AMeRIcA, applies statis-

tical analysis to data gathered from PA administra-

tive sources (representative of the whole population)

rather than to sample surveys. An innovative aspect of

AMeRIcA from the statistical and the ICT viewpoints

is the use of a Data Warehouse designed to integrate

different administrative sources. This enables to ap-

ply statistical analysis models encompassing different

facts of the whole population, deriving in this way sig-

niﬁcant and accurate results in terms of the observed

universe.

2 BUILDING A STATISTICAL

INFORMATION SYSTEM

Within an organization, a SIS is loaded and contin-

uously fed using data sources derived from the ad-

ministrative and management systems. A SIS has two

main purposes (UNECE, 2000): to support decision-

making processes through the construction of direc-

tional indicators which are the ﬁnal result of data col-

lection, analysis, and processing activities; to return

information to the management systems useful for up-

date, evolution and quality management along time.

The ﬁrst operation to be performed to build a

293

Cesarini M., Fugini M. and Mezzanzanica M. (2006).

ANALYSIS-SENSITIVE CONVERSION OF ADMINISTRATIVE DATA INTO STATISTICAL INFORMATION SYSTEMS.

In Proceedings of the Eighth Inter national Conference on Enterprise Information Systems - DISI, pages 293-296

DOI: 10.5220/0002494402930296

 SciTePress

SIS is a detailed study of the source archives. The

data sources quality should be checked and some

data cleaning operations should be performed in or-

der to remove all possible errors that might nega-

tively impact the statistical analysis. Then archives

are checked for cross inconsistencies, and ﬁnally data

are integrated in a global archive.

2.1 Data Integration and Cleaning

The ﬁrst steps required to build a SIS are a detailed

analysis of the archives and the development of a

global integration schema which will drive the sub-

sequent steps. Further activities are the establish-

ment of a mapping schema between the global inte-

grated schema and the single archive schemas (local

schemas). Finally the steps of a process of data mi-

gration towards the integrated archive should be de-

tailed. During data migration some low quality data

issues might occur and should be resolved, as we will

show in Sec. 2.2. Moreover, data loaded into the

global integration schema instance might reveal un-

suitable for the analysis leading to misinterpretations.

For this reason the SIS development process should

be an iterative one, with the aim of progressively tun-

ing the global integration schema and the migration

procedure. Moreover, schemas may not completely

capture the semantics of the data that they describe,

and there may be several plausible mappings between

two schemas. This subjectivity makes it valuable to

have user input to guide the match and essential to

have user validation of the result.

2.2 Data Quality Improvement

The main problem in using administrative databases

for statistical and decision making purposes is the

presence of errors that do not affect the regular use of

the archive for administrative purposes. Such errors

are hardly noticed, and, even when discovered, they

are usually tolerated. However, this errors and low

quality of data can negatively affect statistical analy-

sis. Therefore, data sources need to undergo a qual-

ity improvement pre-processing before being an input

for any kind of analysis. Administrative databases are

employed to access information describing a single

item at a time (e.g., the address of a person), while

statistical analysis deals with collection of items (e.g.

how many people live within an area). This differ-

ent usage of archives may unveil simple errors like

duplicate records, or more complex ones, e.g. some

inhabitants that are registered in the Registry Ofﬁce

of a neighbour town and not in the town where they

live. Some of the problems may be ﬁxed by perform-

ing data cleaning actions whose results have a certain

degree of reliability, therefore requiring manual eval-

uation employing various data quality metrics such

as accuracy, consistency, completeness, timeliness,

and so on (integration quality criteria). Many clean-

ing techniques can be used, we won’t investigate this

topic anymore, we would like to highlight that these

techniques have different costs in term of execution

time required (both to humans and computers) and

“optimal mix selection” issues arise when resources

are scarce. The optimal mix selection is performed

by evaluating an execution cost and a quality improve-

ment rate for each candidate operation. The estima-

tion of both values is a heuristic operation, based on

experience as well.

3 THE AMeRIcA PROJECT

The concepts illustrated are presented for the AMeR-

IcA Project. The approach comprises various and in-

dependent phases: from data integration and quality

analysis, to the deﬁnition of statistical indicators, via

the analysis of information sources, database design,

transformation and data management process, and de-

ﬁnition of a multidimensional model for data analysis

as a decisional support. The reference population is

provided by the Registry of the Milan Municipality.

Data on such population are fundamental, since it is

impossible to obtain a data provisioning from the In-

come Ofﬁce bounded to a geographic area. A cross

reference between the Registry Archives and the In-

come Archives allows one to obtain the desired infor-

mation. The process of data interpretation, cleaning,

and normalization, applied both to single source and

to integrated data, has required a great effort and a

deep data domain knowledge.

The Income Archive holds also some registry in-

formation about people, however preference has been

given to data derived from the Registry Archive, since

it is usually more up to date. In fact, an individual no-

tiﬁes address changes to the Registry Ofﬁce quickly,

while the Income Ofﬁce is notiﬁed once per year with

the tax declaration form. Records describing the same

person in different archives are identiﬁed by the Fis-

cal Code (FC, similar to the US Social Security Num-

ber). Once different records on the same individual

have been identiﬁed, further information (e.g., profes-

sion, qualiﬁcation, education, and so on) signiﬁcant

for analysis and not present in the Income Archive,

may be used. However, the scarce freshness of some

archives would violate the information quality crite-

ria; thus, such additional information has not been

included in the analysis. The portion of data in the

AMeRIcA SIS coming from the Income Ofﬁce refers

to the income returns of both companies and people.

Individuals declare income data by ﬁlling in different

forms, according to the received type of income and

properties. Three common basic macro-information

ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

294

types can be identiﬁed: the total incomes grouped by

income source; the deductions and detractions; the

physical person taxation necessary to determine the

tax drag. Around this information core, an integration

model has been constructed able to drive the migra-

tion process and to highlight information relevant for

statistical analysis. Once the integration model has

been selected, the delivered archive undergoes a pre-

processing aimed at improving the quality and relia-

bility of information, and aimed at framing the classi-

ﬁcations to the adopted standards. Two types of pre-

processing procedures are used: semantic and syntac-

tic cleaning. Hence, two different integration levels

can be identiﬁed: integration at a single archive level,

regarding provisioning over different years, and inte-

gration at a global level where different archives are

involved. 1) Integration at a single archive level: Pro-

visions over different years of the same archive can

comprise heterogeneous information and hence must

be reconciled to a unique data model taking into ac-

count information common to the different deliveries.

The selection of the common information is driven

by the analysis to be performed later, privileging rele-

vant information or data present over different years,

and hence comparable. A meaningful example in this

case is the delivery of an archive from the Income Of-

ﬁce: in the considered years, the tax laws have un-

dergone many changes which caused the information

record of tax income to change every year. 2) Inte-

gration in the system: this includes the link among

different information, coming from distinct sources.

The goal is to enrich the information content of the

subjects to be analyzed (and consequently the range

of possible queries) by collecting different informa-

tion about the same subject that are scattered among

different sources. The process described in the pre-

vious steps can be summarized in terms of the ﬂow

reported in Fig. 1.

Initial Data

Provisioning

Private Data

Masquerading

Archive

Year X

Registry

Archive

Year X

Tax

Archive

Year X+1

Registry

Archive

Year X+1

Data

Warehouse

Year X

Data

Warehouse

Year X+1

Economic Data Mart

Demographic Data

Mart

Data Mart Level

Data Warehouse

Level

Operative

Data

Storage

Figure 3: The AMeRIcA Data Warehouse.

lows to outline the guidelines to deﬁne decision mak-

ing policies. In recent years, the reuse of statistical

data (Hoffmann, 1995), (Thomson and Holmy, 1998)

has increased the demand for easy access to a va-

riety of pre-existing data sources (Sundgren, 1996).

Some works address the integration of existing data

sources of national or regional statistical ofﬁces, or

providers of comparable nature (Denk and Froeschl,

2000), (Hatzopoulos et al., 1998). Other works lever-

age metadata classiﬁcation to drive data integration

and elaboration (Papageorgiou et al., 2001); another

category of works refer to quality of data (IQ1, 2005),

and speciﬁc quality assurance for census data (Cen-

sus, 2005). An attempt to feed a SIS using PA’s or

large enterprises’ archives is reported in (Buzzigoli,

2002) for efﬁcient information system integration in

a PA structure (e.g., the census of archives within

an administration). However a discussion concerning

quality of data, consistency, and archive integration

issues is still missing. The link between an adminis-

trative and management system and the SIS is bidirec-

tional: the administrative, management system feeds

the SIS, while the SIS provides indications to the

administrative and management one to support ame-

liorations along time. Such link is strong, although

poorly implemented in practice. Administrative sys-

tems are designed using an auto-referential logic that

privileges the deﬁnition of services functional to the

organizational model rather than to the stakeholders

or to the statisticians. This reﬂects in expensive ac-

tivities to normalize, ensure data quality and stan-

dardization as required. An enabling factor for SIS

construction is the ability of a PA to take into ac-

count the transversally and reciprocal acknowledge-

ment of concepts, even if used in different adminis-

trative processes, and to obtain that such concepts are

in relation with standard codiﬁcations. Another fac-

tor is related to the quality of documentation provided

by the sources which is often scarce, or not present,

making the SIS conceptual design harder. A current

development of AMeRIcA regards the use of social

security data. Using social security data owned by

employment centres, it will be possible to correctly

identify the available wealth of a larger set of citizens.

REFERENCES

Buzzigoli, L. (2002). The new role of statistics in local pub-

lic administration. In Proceedings of the Conference

Quantitative Methods in Economics (multiple Criteria

Decision Making XI), pages 28–34, Faculty of Eco-

nomics and Management, Slovak Agricultural Univer-

sity, Nitra (SK).

Census (2005). Census bureau section 515 information

quality guidelines, OFFICE OF MANAGEMENT

AND BUDGET, guidelines for ensuring and maxi-

mizing the quality, objectivity, utility, and integrity of

information disseminated by federal agencies. Avail-

able at http://www.census.gov/quality/.

Denk, M. and Froeschl, K. (2000). The IDARESA data

mediation architecture for statistical aggregates. Re-

search in Ofﬁcial Statistics, 3(1):pp.7–38.

Hatzopoulos, M., Karali, I., and Viglas, E. (1998). At-

tacking diversity in NSIs’ Storage Infrastructure: The

ADDSIA approach. In Proceeding of International

Seminar on New Techniques and Technologies in Sta-

tistics, pages 229–234, Sorrento (IT).

Hoffmann, E. (1995). We must use administrative data for

ofﬁcial statistics - but how should we use them? Sta-

tistical Journal of the United Nations/ECE, 12:pp. 41–

48.

IQ1 (2005). Information quality I, 2005. Principles and

foundation, the MIT total data quality management

program. Available at http://web.mit.edu/

tdqm/www/index.shtml.

Papageorgiou, H., Pentaris, F., Theodorou, E., Vardaki, M.,

and Petrakos, M. (2001). A statistical metadata model

for simultaneous manipulation of both data and meta-

data. J. Intell. Inf. Syst., 17(2-3):pp. 169–192.

Statistics Denmark (2000). The use of administrative

sources for statistics and international comparability

(invited paper). In Conference of European Statis-

ticians, 48th plenary session, Paris (FR). Statistical

Commission and Economic Commission for Europe.

Sundgren, B. (1996). Making statistical data more avail-

able. International Statistical Review, 64(1):pp. 23–

38.

Thomson, I. and Holmy, A. (1998). Combining data from

surveys and administrative record systems - the nor-

wegian experience. International Statistical Review,

66(2):pp. 201–221.

UNECE (2000). Statistical metadata. In Conference on Eu-

ropean Statisticians Statistical Standards and Studies

- No. 53, Geneva (CH).

ICEIS 2006 - DATABASES AND INFORMATION SYSTEMS INTEGRATION

296