only one identifying element, and composite
domains may involve multiple elements. For
example, male first name and phone number are
simple domains, whereas a full address domain
(street, city, state, zip code) may be a composite
domain. To preserve consistency, composite
domains have to be mangled as a whole unit. For
example one instance of an address may be swapped
with another random but complete address. iSTDE
also sometimes subdivides a domain wherein
swapping needs to be constrained by the value of
some other element. For example, it partitions
gender-dependent domains into female and male
subset, i.e., first name domain is partitioned into
male first names and female first names.
After PII domains selection, we build dictionaries
for these domains. These domain dictionaries are
data structures that consist of real domains and test
domains. Real domains are data slices that are built
from similar PII domains from across all databases
included in CHARM, not just one database. For
example, in the case of first male names, the
dictionary real domain contains all first male name
entries that exist in all tables of temporary databases.
Test domains are populated by semi-random
shuffling of real domain entries. The term semi-
random hints that there are chances that an entry in a
real domain maps to the same entry in a test domain.
Entries in test domains would be the newly assigned
values for the real data. In the next step of the
mangling process, we swap all the values of PII
domains in real data with the newly assigned test
values, using domain dictionaries that provides
mapping from real values to test values. Once we
have mangled all the data, we then delete these
dictionaries so that no one can perform reverse
mapping to real data. Essentially, iSTDE deals with
three different types of data mangling.
The first type is 1-1 logical domain dependency.
Two domains are said to be logically dependent
when they are semantically related to each other, a
change in one domain requires a similar change in
the other. Consider two domains, D1 and D2, which
have a 1-to-1 logical dependency between them but
have different data representations. When we swap a
value in one of the domains, a corresponding swap
must also be made in the second. More specifically,
if x, x'∈ D1 and y, y'∈ D2 such that x ↔ y, and x'
↔ y', then if x is swapped with x', y must also be
swapped with y' and vice versa. For example,
consider two tables containing identical
demographic information about patients. One table
uses just one column to store birth dates, i.e., say
05/11/2009 for patient A, while another table uses
three columns to store the same birth date of patient
A, i.e., say 05 as MM, 11 as DD, and 2009 as
YYYY. iSTDE ensures that two tables maintain the
same logical dependency after mangling, that is, if
the birth date 05/11/2009 is swapped with some
other date 07/10/2007 in one table, iSTDE also
makes the same logical swap in the other table that
uses three columns to represent the birth dates.
The second type of dependency in the iSTDE
mangling process is called data value dependency.
Two domains D1 and D2 are said to have a data
value dependency when for any single record that
uses values from both domains, there is a constraint
involving those values in these domains. Then, if
values in D1 are swapped, a random swap must also
be made in D2, but the original constraint must still
hold (if the original record satisfies that constraint.)
More specifically, if x, x'∈ D1and y, y'∈ D2 such
that x⊗y where ⊗ represent some constraint, then if
x is swapped with x', y can also be swapped with y'
as long as x'⊗y'. Stated another way, we can say that
a child birth date in any of the databases cannot be
greater than a parent birth date.
The third type of data mangling relates to the
mangling of computed fields and partial computed
fields. These two types of fields are considered
dependent and are derived from some other fields.
For example, a full name field can be a computed
field as it is derived from a first name and last name.
When iSTDE mangles the first name and last name,
it also re-computes the full name to maintain names
consistency. Partial computed fields are those fields
that have partial independent values and partial
computed values. For example, a contact name field
can contain a brother name. It might be possible that
two brothers have the same last name, so if we
mangle the last name, we also need to re-compute
the partial value in the contact name field.
4.5 Transferring Mangled Data
Once mangling of data is complete, the fifth step in
the entire process is the automatic transfer of the de-
identified test data to the user-specified unprotected
environment. To do this, iSTDE creates a dump of
all the temporary databases, transfers them via a
secure copy to the unprotected environment, and
executes remote commands to restore those dumps
in databases in the unprotected environment. A
significant challenge while transferring the test data
was to manage the access controls and firewalls.
iSTDE uses a number of built-in scripts in
confidential environment to manage these network
transfer obstacles.
SEMANTIC BASED TEST DATA EXTRACTION FOR INTEGRATED SYSTEMS (iSTDE)
179