we expect more attention to them will be given.
The only previously reported tool we know of
for generating incomplete data is that of Francois
and Leray (Francois and Leray, 2007). They em-
ploy Bayesian networks (BNs) as a useful way to
generate artificial data with missing values. Unfor-
tunately, their tool is limited to MCAR and limited
forms of MAR incompleteness, with no ability to pro-
duce NMAR data. As Francois and Leray point out,
all of these forms of generating missing data can be
useful for generic software testing, beyond machine
learning research.
The structure of our paper is as follows. Section 2
describes the three absent data mechanisms and in-
troduces our nomenclature for them. In Section 3
we present a BNF (Backus-Naur Form (Backus and
Naur, 1960)) grammar for scripting DataZapper. In
Section 4 we present the details of DataZapper, in-
cluding data formats in Section 4.1 and an overview
of how it works in Section 4.2. Section 5 illustrates
DataZapper’s use in an experimental setting.
2 ABSENT DATA MECHANISMS
A dataset is a matrix in which rows represent the cases
(joint samples) and columns represent variables mea-
sured for each case. Ideally, a dataset has all the cells
filled—i.e., it is a complete data set. However, most
real datasets have some values unobserved—i.e., they
are incomplete.
As we mentioned, Rubin (Rubin, 1976) intro-
duced and named three types of missing data mech-
anisms. We shall now motivate the adoption of new
names for these. First, we prefer to talk of “absent
data” rather than “missing data”, for the simple but
sufficient reason that “absent” has a natural nominal
form, “absence”, while “missing” leads to the awk-
ward neologism “missingness”. More significantly,
two of Rubin’s labels are clearly inadequately de-
scriptive of the mechanisms involved:
Missing Completely at Random (MCAR): as the
absence of values is independent of all variable
values, including the value for this particular cell,
this label is actually appropriate. Therefore, we
propose calling these cases absent completely at
random (ACAR).
Missing at Random (MAR): these missing cases
have arbitrary dependencies upon the values of
other variables. In consequence, they may not
even be random at all, but functionally depen-
dent upon the values of other variables in extreme
cases. Hence, we prefer calling them absent un-
der dependence (AUD).
Not Missing at Random (NMAR): The natural
way of interpreting this phrase is by negating
the second kind of “missingness”, which would
be entirely wrong. This case is simply a gener-
alization of AUD, allowing the absence of data
to depend also upon the actual value which has
failed to be measured. Hence, we have absent
under self-dependence (AUSD).
We submit that the most common case in real data
is the case most commonly ignored, AUSD, where the
values going unmeasured depend both on the values
of some other variables and the absent values them-
selves, as in wealthy lawyers hiding their wealth.
3 SCRIPTING DATAZAPPER
The specifications for how the data should go miss-
ing are made in a simple scripting language, whose
BNF grammar is shown in Figure 4. These rules are
applied to a dataset file to generate a new dataset file
with some observed values replaced by a token in-
dicating absence. The basic form of a sentence is
that of an “if... then...” production rule. The an-
tecedent describes the dependencies that absence has
on variables and values in the system, while the con-
sequent lists the variables that take absent values on
these conditions and with what probability. If the an-
tecedent is empty, then the absent data generation is
unconditional—i.e., the data are ACAR in so far as
this production rule is concerned. If the consequent is
empty, then the absence mechanism is applied to all
variables in the dataset. When the data are AUD or
AUSD, the antecedent grammar rule specifies which
variable(s) the absence depends upon and for what
values or value ranges. The effects of the script rules
are cumulative. The result is a language in which
any strength of dependence upon any set of variables
can be specified, and such dependencies may be com-
<m-statement> :: = if <antecedent> then <consequent>
<antecedent> ::= <condition>*
<condition> ::= <variable> in <range>
<variable> ::= alpha alphanum*
<range> ::= [ <value>, <value> ]
<value> ::= alpha alphanum* | number | symbol
<consequent> ::= ( <prob> ) <variable>*
<prob> ::= number
Figure 4: BNF grammar for generating absent data.
DATAZAPPER: GENERATING INCOMPLETE DATASETS
71