NEURAL NETWORKS FOR DATA QUALITY MONITORING OF

TIME SERIES

Augusto Cesar Heluy Dantas and Jos

e Manoel de Seixas

Signal Processing Lab, COPPE/EP - Federal University of Rio de Janeiro, C.P. 68504, Rio de Janeiro 21945-970, Brazil

Keywords:

Data Quality, Feed-Forward and Recurrent Neural Networks, Time Series, Monitoring System.

Abstract:

Time series play an important role in most of large data bases. Much of the information comes in temporal

patterns which is often used for decision taking. Problems with missing and noisy data arise when data

quality is not monitored, generating losses in many ﬁelds such as economy, customer relationship and health

management. In this paper we present a neural network based system used to provide data quality monitoring

for time series data. The goal of this system is to continuously adapt a neural model for each monitored series,

generating a corridor of acceptance for new observations. Each rejected observation may be substituted by

its estimated value, so that data quality is improved. A group of four diverse time series was tested and the

system proved to be able to detect the induced outliers.

1 INTRODUCTION

Technology and scientiﬁc research are becoming

more and more “data driven”, and specialists say that

this century is surely the century of data (Donoho,

2000). For example, the poor quality of customer

data costs U.S. business more than US$600 billion

per year (Eckerson, 2001). It is then easy to see that

data are critical assets in the information economy,

and that the quality of the data from a company or

research center is a good predictor of their future suc-

cess.

Time series data are some of the most common

types of data involved in information systems. Much

of the data stored in nowadays databases come in

temporal patterns which are important sources of

knowledge. The development of time series fore-

casting methods is gaining importance as companies

and research centers give more emphasis to a data

based knowledge and make their investments in a data

driven way. Neural networks have been successfully

applied to the task of modeling time series, specially

since the end of the 1980’s (Kaastra and Boyd, 1996).

As the amount of data in such data bases increases

exponentially, a Data Quality Monitoring System may

be of crucial importance to maintain data free-of-

error. In this context, we propose a neural network

based system to treat time series data. The monitor-

ing system should be able to detect faults in data (such

as outliers and missing values) and replace them by

accurate estimates.

As an example of the importance of monitoring

data quality, we show in Figure 1 the difference be-

tween two series that intend to represent the same in-

formation, that is, the AMD (Advanced Micro De-

vices Inc.) stock value at the end of each day. The cer-

tiﬁed series was obtained from Stockwiz (Stockwiz,

2006), and the non-certiﬁed series was freely down-

loaded from Yahoo!Finance (Yahoo!Finance, 2006).

We may note that Yahoo!Finance has clearly im-

proved its data quality by the end of the year 2000,

following a global tendency of applying efforts to as-

sure a minimum of data quality. However, free data

will never offer the same guarantee of being free-of-

error like certiﬁed data and must always be analyzed

before use.

The aim of this work is not to develop a predic-

tion algorithm; actually, what we propose is a sys-

tem that monitors the quality of time series data by

drawing reliability regions (corridors of acceptance)

from the predicted values. The monitoring system is

designed to operate in accordance to a human super-

411

Cesar Heluy Dantas A. and Manoel de Seixas J. (2007).

NEURAL NETWORKS FOR DATA QUALITY MONITORING OF TIME SERIES.

In Proceedings of the Ninth International Conference on Enterprise Information Systems - AIDSS, pages 411-415

DOI: 10.5220/0002371004110415

 SciTePress

1996 1998 2000 2002 2004 2006

−15

−10

−5

Days (from Jan−19−1996 to Sep−01−2006)

Difference between Certified and Non−Certified Series (%)

Figure 1: Percentage differences between certiﬁed and non-

certiﬁed AMD data along the years.

visor, who may validate or not the system automatic

decisions (Dantas and Seixas, 2005).

In the next section, we give a brief description of

some time series data quality issues. In Section 3,

we summarize the neural networks methods used in

the system developed for time series modeling. The

complete monitoring system is presented in Section 4.

Some application examples are detailed in Section 5,

and the conclusions are addressed in Section 6.

2 DATA QUALITY ISSUES FOR

TIME SERIES

We may deﬁne Data Quality (DQ) as follows: data

has quality if it satisﬁes the requirements of its in-

tended use, that is, if it has conformity to the speciﬁ-

cations (Olsen, 2003).

Of course, this deﬁnition is closely related to the

accuracy dimension, meaning that more accurate data

have always a greater quality. On the other hand, it

does not mean that not so accurate data have necessar-

ily low quality. For example, a 90% accurate database

may be considered to have a poor data quality if one

intends to use it in a high security purpose, but this

same data may be viewed as high quality data if its

intended use relates to ﬁnding potential costumers of

a new product.

Timeliness (how up-to-date is your database?) and

completeness (does your database have missing val-

ues?) dimensions are also taken into account in our

monitoring system.

t−N +1

˜x

t+F

...

t−1

Figure 2: Feed-forward scheme for predicting future data.

3 NEURAL NETWORK

ARCHITECTURES

The developed system works with two types of neu-

ral networks: feed-forward with back-propagation

(FFBP) training algorithm (Haykin, 1999) and recur-

rent Elman network (Elman, 1990). The choice for

the architecture to be used is also data driven, that is,

for each analyzed time series we test the accuracy of

both models and choose the best.

Figure 2 shows in a simple way the prediction

scheme used with a feed-forward network. To form

the input vector, we ﬁrst analyze the autocorrelation

plot of the whole (pre-processed) training series and

then choose all the N past lagged observations that

proved to have signiﬁcant correlation with the refer-

ence observation. The hidden layer contains H biased

neurons with the hyperbolic tangent as the activation

function. For the output layer, we use a non-biased

linear neuron for x

t+F

estimation. In this work, the

lag F is always one, but generalization for larger hori-

zons of prediction is straightforward. These predic-

tions form an “uncertainty cone” (the corridor), inside

which the new observations must lay to be accepted as

valid data.

Figure 3 illustrates the generic architecture of an

Elman recurrent network. The main difference with

respect to the FFBP scheme is that Elman networks

have a feedback loop in the hidden layer, which al-

lows all the past observations to contribute to the de-

termination of the future ones.

4 THE MONITORING SYSTEM

As previously mentioned, the goal of the monitoring

system is not to provide data prediction, but to per-

form a validation test to accept or reject (with the

ICEIS 2007 - International Conference on Enterprise Information Systems

412

Input Recurrent tanh layer Output linear layer

(k) = y

(k) = LW

2,1

(k) + b

(k) = tanh(IW

1,1

p + LW

1,1

(k − 1 + b

))

(k − 1)

(k)

1,1

2,1

Figure 3: Elman network generic structure.

proper correction), if desired, any new observation of

the monitored time series.

Data Monitoring comprises two main phases. In

the ﬁrst one (which is an off-line phase), a sufﬁcient

number of past observations of the series are used

to adjust a pre-processing algorithm, that will be de-

scribed in the next subsection. These pre-processed

observations are then used to train a neural forecast-

ing model, using both feed-forward and recurrent net-

work architectures as described in the previous sec-

tion. The trained neural model is considered to be

adjusted when the forecasting error for the test set

(composed by the last observations of the available

data) lay within a previously speciﬁed value for the

individual time series under analysis. In this phase,

the error ﬁgure used is the MAPE (Medium Absolute

Percentage Error).

The second phase works on-line. Each monitored

series has already had an adjusted model ready to re-

ceive a new observation and to test its validity. This

validation test will be detailed in the subsection 4.2.

4.1 Semi-automatic Pre-processing

Algorithm

The pre-processing phase has a great importance and

much of the monitoring success depends on it.

First of all, one must detect if the time series trend

is stochastic or deterministic. We perform this ver-

iﬁcation by applying a combination of Augmented

Dickey-Fuller (ADF) (Dickey and Fuller, 1979) and

Phillips-Perron (Phillips, 1987) tests. The resulting

test detects whether or not there are unit roots in the

generating process of the time series. If they exist, it

means that the trend is stochastic and that one must

take the ﬁrst difference of the series n times in order

to make it stationary, being n the number of unit roots

detected (that is, the order of integration of the pro-

cess).

If the test detects no unit roots, one may conclude

that the trend is deterministic. In this case, we remove

the trend by performing a polynomial ﬁtting. In both

stochastic or deterministic cases, we also test for the

presence of seasonal cycles (after trend removing), re-

greater than zero?

of unit roots

Is the number "n"

Verify the presence of any cycles

and remove them by spectral (FFT)

analysis or periodic diffrencing

Series is now pre−processed and

ready to be fed into the neural system

of the series "n" times

Take the first difference

Set "Stochastic Trend" flag "On"

Set "Deterministic Trend" flag "On"

Remove the series trend

by polynomial fitting

Perform Unit Root Test

Select a series

Figure 4: Semi-automatic pre-processing algorithm to re-

move trends and possible cycles.

moving them by spectral (Fourier) analysis or by a

convenient periodic differences

(Chatﬁeld, 1984).

This pre-processing phase is claimed to be semi-

automatic because it cannot exclude the intervention

of a human specialist, who inserts his or her knowl-

edge into the system, mainly for cycles removing.

Figure 4 shows these steps in a ﬂow diagram.

4.2 Validation Test

The on-line phase of the monitoring system consists

of receiving new observations and testing their va-

lidity according to the previously developed model.

When a new observation x(T) arrives, we have al-

ready an estimated (predicted) value for it (ˆx(T)),

which was determined by the adjusted neural model.

Here, human intervention may also be necessary.

For example, new observations can be excluded by

direct action of a manager who feels that they are bi-

ased by some temporary condition that he or she is

aware of (and that the system cannot be able to track

on time). On the other hand, the system is designed to

automatically screen all current observations to iden-

tify those that appear unusual, that is, outliers. Each

outlier could be called to the attention of an appro-

priate management person, who would then decide

whether or not to include the observation in the fore-

casting process (Montgomery et al., 1990). In fact,

this outlier may have a reasonable origin, or may sim-

ply be the result of error.

Outliers can be identiﬁed by analyzing the fore-

cast error e(T) = x(T) − ˆx(T). If this error is large,

it may be concluded that the observation x(T) came

If a series is known to have a weekly seasonality (as

the case of electricity consume), it is more convenient to

remove this cycle by applying the (1− B

) operator, where

B is the delay operator, that is, Bx(t) = x(t −1).

NEURAL NETWORKS FOR DATA QUALITY MONITORING OF TIME SERIES

413

0 100 200 300 400 500 600 700

(a)

0 50 100 150 200 250 300 350 400

x 10

(b)

0 200 400 600 800 1000 1200 1400 1600 1800

(c)

0 100 200 300 400 500 600

1.5

2.5

x 10

(d)

Figure 5: Series used to test the system: (a) Monthly USA

civilian unemployment, from 1952/Jan to 2006/Aug; (b)

Monthly Electricity consume in USA, from 1973/May to

2006/Aug; (c) Daily SUN close stock values, from 1998/Jan

to 2002/Dec and (d) Monthly USA total population, from

1952/Jan to 2006/Aug.

from a different process. The test for outliers may

logically take the form

e(T)

|> K (1)

where K is typically 4 or 5 (for outlier detection). In

this paper, we used K = 4, in order to achieve a more

restrictive ﬁltering of the time series. So, the width

of the corridor to detect outliers is 4

, where

dynamically updated.

5 SOME STUDY CASES

We tested the system for several real time series data,

representing various kinds of processes. In this sec-

tion, we summarize some of the obtained results. Out-

liers and missing values

were artiﬁcially introduced.

Figure 5 shows four series used to test the

monitoring system: USA civilian unemploy-

ment (Economagic, 2006), USA Total Electricity

Consume (Economagic, 2006), SUN Microsystems

Stock Close Value (Stockwiz, 2006) and Total USA

Population (Economagic, 2006). The process of

removing trend and cycles is illustrated in Figure 6

for the unemployment series.

Training the neural models showed that an FFBP

network with 8 hidden neurons (h. n.) achieves the

As all the tested series are always greater than zero,

missing values were induced by introducing zeros in the se-

ries.

0 100 200 300 400 500 600 700

−5

Months (from 1952/Jan to 2006/Aug)

Unemployment

0 0.01 0.02 0.03 0.04 0.05

100

200

300

400

Normalized Frequency

Magnitude

Cycle 1

Cycle 2

Cycle 3

Figure 6: Removing trend and cycles from the unemploy-

ment series.

best MAPE (2.49%) for this series, while an Elman

network achieves 2.66%. This was the only case

where Elman MAPE was found (slightly) larger than

FFBP one. Figure 7 shows the application of the

monitoring system to the test series (last 40 observa-

tions of the total series, not seen by the neural model).

Three outliers were introduced and correctly removed

and substituted by the estimated values

, while none

of the series original points was rejected. The visual-

ized corridor is suitable to detect outliers without re-

jecting extreme (but correct) values. Note that it was

possible to correctly reject the third outlier because of

the accuracy of the estimated series, which made that

the corridor of acceptance did not reach the outlier.

This same experiment was performed for the remain-

ing series under test.

Table 1 summarizes the obtained results.

6 CONCLUSIONS

The relevance of monitoring systems to assure data

quality through fault detection and correction was

stated. By online ﬁltering outliers and replacing miss-

ing values, the proposed neural system works towards

increasing the most important data quality issues (ac-

curacy, timeliness and completeness), by substituting

each wrong observation or ﬁlling each missing data

by a proper estimated value.

Results demonstrated the tendency for Elman net-

works outperforming the FFBP ones. This is due to

Of course, a false point that felt inside the corridor of

acceptance would not be removed by the semi-automatic

algorithm. It could not be treated as an outlier by the system

and should be detected by another technique.

ICEIS 2007 - International Conference on Enterprise Information Systems

414

0 5 10 15 20 25 30 35 40 45

4.5

5.5

6.5

Months (from 2003/May to 2006/Aug)

Civilian Unemployment Rate (Percent)

Corridor of Acceptance

Received Series

Estimated Series

Detected Outliers

Replaced Values

Figure 7: Monitoring the test series of unemployment.

Table 1: Summary results for some of the tested series.

Series Name Features

USA Unemployment

Frequency Monthly

Trend Deterministic (linear)

Cycles four low frequency

Neural Network 8 h. n. FFBP

MAPE 2.49%

Electricity Consume

Frequency Monthly

Trend Deterministic (linear)

Cycles three

Neural Network 6 h. n. Elman

MAPE 4.01%

SUN Stock

Frequency Daily

Trend Stochastic (n = 1)

Cycles two low frequency

Neural Network 4 h. n. Elman

MAPE 1.81%

USA Population

Frequency Monthly

Trend Stochastic (n = 1)

Cycles three low frequency

Neural Network 8 h. n. Elman

MAPE 0.74%

the feedback loop in the hidden layer, which facili-

tates the detection of temporal patterns. This moni-

toring system should then be used as a support tool to

perform online ﬁltering when acquiring non-certiﬁed

time series data, or to scan databases searching for

outliers and appropriately completing the missing val-

ues.

ACKNOWLEDGEMENTS

We would like to thank CNPq and FAPERJ (Brazil)

for their support to this project.

REFERENCES

Chatﬁeld, C. (1984). Analisys of Time Series. Chapman and

Hall.

Dantas, A. C. H. and Seixas, J. M. (2005). Adaptive

neural system for ﬁnancial time series tracking. In

Ribeiro, B., editor, ICANNGA - International Con-

ference on Adaptive and Natural Computing Algo-

rithms, Springer Computer Series: Adaptive and Nat-

ural Computing Algorithms, pages 421–424. Elsevier.

Dickey, D. A. and Fuller, W. A. (1979). Distributions of the

estimators for autoregressive time series with a unit

root. Journal of the American Statistical Association,

75:427–431.

Donoho, D. L. (2000). High-dimensional data analysis: The

curses and blessings of dimensionality. Lecture for the

American Math. Society “Math Challenges of the 21st

Century”.

Eckerson, W. W. (2001). Data quality and the bottom line.

Technical report, The Data Warehousing Institute.

Economagic (2006). Economagic web site.

www.economagic.com.

Elman, J. L. (1990). Finding structure in time. Cognitive

Science, 14:179–211.

Haykin, S. (1999). Neural Networks - a Comprehensive

Foundation. Prentice-Hall, 2nd. edition.

Kaastra, I. and Boyd, M. (1996). Designing a neural net-

work for forecasting ﬁnancial and economic time se-

ries. In Neurocomputing, number 10, pages 215–236.

Elsevier.

Montgomery, D. C., Lynwood, A. J., and Gardner, J. S.

(1990). Forecasting and Time Series Analysis.

McGraw-Hill.

Olsen, J. E. (2003). Data Quality: the Accuracy Dimension.

Morgan Kaufmann Publishers.

Phillips, P. C. B. (1987). Time series regression with a unit

root. Econometrica, 55(2):277–301.

Stockwiz (2006). Stockwiz web site. www.stockwiz.com.

Yahoo!Finance (2006). Yahoo!Finance web site.

http://ﬁnance.yahoo.com.

NEURAL NETWORKS FOR DATA QUALITY MONITORING OF TIME SERIES

415