Novel Approach to De-Identify Relational Healthcare Databases at Rest:

A De-Identiﬁcation of Key Data Approach

Yazeed Ayasra

, Mohammad Ababneh

and Hazem Qattous

Princess Sumaya University for Technology, Amman, Jordan

yazeed.ayasra@gmail.com, {m.ababneh, h.qattous}@psut.edu.jo

Keywords:

Health Information Security, Relational Database Security, Data Anonymization, ePHI Security, HIPAA

Compliance.

Abstract:

Health information systems are widely used in the healthcare sector, and migration to cloud-based applica-

tions continues to be prominent in recent practices. Various legislations were issued by different countries to

ensure the conﬁdentiality of personal health information, introducing liabilities, and imposing penalties and

ﬁnes to organizations in violation. This drives organizations to deploy signiﬁcant investments in information

security to safeguard various health information systems. The healthcare industry has experienced the second

highest data breaches compared to other industries at 24.5% of the total data breaches in the United States

between 2005 and 2023. Database layer vulnerabilities remain one of the most exploited resulting in attacks

causing devastating conﬁdentiality breaches for electronic personal health information (ePHI). The framework

suggested in this work relied on de-identiﬁcation using the health insurance portability and accountability act

(HIPAA) safe harbor method of removing 18 identifying attributes from the data in its resting state. To achieve

this, the work proposes 7 rules that allow the migration of health information system databases to the suggested

framework structure to maintain a de-identiﬁed state of the database at rest. This is achieved through the seg-

regation of identifying information in different tables based on their identiﬁcation power and frequency of

use while structuring them in a hierarchical manner where tables refer to the next or previous levels through

encrypted foreign keys. The paper extends to successfully transform a typical EHR system database schema

into a de-identiﬁed version of itself abiding to the 7 rules suggested by this work.

1 INTRODUCTION

This To tackle the various issues of ePHI security,

the US Congress acted and passed the Health Insur-

ance Portability and Accountability Act (HIPAA) in

1996. The main objective of HIPAA was to estab-

lish a uniform set of standards for the disclosure of

health data by healthcare providers. The act also sets

standards for various other aspects including infor-

mation security, privacy, fraud prevention, electronic

data exchange, healthcare access, revenue, and in-

surance (Miller and Payne, 2016). HIPAA imposes

penalties and ﬁnes on organizations in breach of its

security and privacy rules reaching up to 1.5M USD

per incident. This drives organizations to deploy sig-

niﬁcant investments in information security to safe-

guard various health information systems. One way

to perform data sharing while abiding by HIPAA reg-

https://orcid.org/0000-0001-9283-7251

https://orcid.org/0000-0002-6431-2201

https://orcid.org/0000-0002-8468-1602

ulations is the production of de-identiﬁed datasets.

De-identiﬁed datasets are deﬁned as health informa-

tion that does not have any reasonable basis to believe

that identiﬁcation of individuals using it is possible

(Caplan, 2003). HIPAA requires validation of de-

identiﬁed datasets through one of two methods: ex-

pert determination or safe harbor. Expert determina-

tion requires a formal inspection and approval by a

qualiﬁed subject matter expert, while the safe harbor

method requires the removal of 18 personal health in-

formation identiﬁers. Limited datasets are also widely

used for research purposes. Limited datasets are lim-

ited sets of identiﬁable patient information that are

considered personal health information because of the

possibility of re-identiﬁcation as per the HIPAA pri-

vacy rule (Caplan, 2003). Most recent technologies

in cloud applications relied on the three-tier approach

with an SQL Server as the database server. Three-

tier setups improve performance and limit the num-

ber of attacks on databases due to their segregation

from the application layer (Singh, 2024). Although

Ayasra, Y., Ababneh, M. and Qattous, H.

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach.

DOI: 10.5220/0013302000003911

Paper published under CC license (CC BY-NC-ND 4.0)

In Proceedings of the 18th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2025) - Volume 2: HEALTHINF, pages 207-218

ISBN: 978-989-758-731-3; ISSN: 2184-4305

207

this architecture provides an extra layer of protection

for database servers, it is still prone and vulnerable

to all common database server attacks through the

application layer and directly on the database layer

(OWASP, 2024). Piggy-backed queries - where an at-

tacker attempts to piggyback additional statements to

the pre-existing SQL queries – are very common and

constitute most SQL attacks (Sai Lekshmi and De-

vipriya, 2017), SQL injections constitute 15% of at-

tacks against cloud-based applications (Unlu and Bi-

cakci, 2010). Such attacks and others directly con-

tribute to risks against information integrity, avail-

ability, and conﬁdentiality. In the healthcare context

data leaks result in breaches of conﬁdentiality and pri-

vacy of electronic personal health information expos-

ing patients and organizations to signiﬁcant liabili-

ties. Newer DBMS versions support a variety of fea-

tures to prevent common attacks such as stored proce-

dures that can help prevent SQL injection attacks but

only when stored procedures are implemented cor-

rectly (Patel et al., 2020). Although such features

would increase database security, vulnerabilities are

still present and database information is still acces-

sible to internal resources managing them. In the

United States, 35,167 known data breaches have been

reported from 2005 to 2023 (Rights, 2023).

Figure 1: Proposed methodology by (Rights, 2023).

The healthcare industry has experienced the sec-

ond most data breaches compared to other industries

at 24.5% of the total data breaches (Rights, 2023).

From 2020 to 2022, the industry suffered from 1,046

data breaches in the United States alone, costing an

average of $6.45 million per incident (Enterprise,

2018).

Anonymization and de-identiﬁcation of data is

one solution adopted by the healthcare community

to produce datasets for use in the research communi-

ties. De-identiﬁed datasets enable information shar-

ing for research and public health purposes but can-

not be used to protect data at rest for various health

information systems. This is due to the integral need

for personal health information identiﬁcation for con-

tinuity of care purposes. This work explores a method

to de-identify data at rest and allow re-identiﬁcation at

the application layer.

Various security methods, models and new tech-

nologies are widely used to protect applications and

databases from unauthorized access or data leaks.

Mandatory Access Control (MAC) is an access con-

trol model enabling access to information based on

conﬁdentiality of information and clearance level.

Encryption converts information to ciphertext al-

lowing access to authorized users and applications

in custody of the key to decipher the code and

access the original information. Blockchain is a

distributed database composed of blocks accessed

through the acquisition of cryptographic hashes to re-

veal stored information. The framework proposed in

this work attempts to utilize MAC, encryption and

a blockchain-like structure to de-identifty electronic

personal health information in relational databases.

The usage of cloud-based platforms has expo-

nentially increased in the past few years due to the

increasing need for medical data available online

(Chhabra et al., 2022). Various types of cloud-based

platforms were required for a variety of clinical and

medical applications such as contact tracing, labora-

tory test results, travel health declarations, clinical en-

counters, and other public health tools. This resulted

in a signiﬁcant increase in the amount of electroni-

cally identiﬁable personal health information (ePHI)

available online that could be vulnerable to various

cloud-based application attacks. Such high availabil-

ity of health information exposes organizations to sig-

niﬁcant implications and liabilities due to conﬁden-

tiality breaches of electronic personal health informa-

tion (ePHI) resulting in violations of health informa-

tion privacy policies and laws in various countries.

Four out of the top ﬁve vulnerabilities published

by the Open-Source Foundation for Application Se-

curity (OWASP) can directly impact the database

layer and cause unauthorized access to stored infor-

mation revealing conﬁdential information. Broken

access control, injection, insecure design, and secu-

rity misconﬁgurations can all lead to privacy and con-

ﬁdentiality breaches with devastating impacts on the

healthcare sector. Best practices such as parameter-

ized queries, stored procedures, and other solutions

are still vulnerable to a variety of attacks in certain

setups (OWASP, 2024).

The emerging needs of big data had rendered non-

relational (NoSQL) databases a widely used solution

for data storage and retrieval. However, relational

(SQL) databases remain the most widely used solu-

tion for data storage and retrieval in various applica-

tions (Capris et al., 2022). Those databases remain

prone to various attacks resulting in conﬁdentiality

breaches for electronic personal health information in

HEALTHINF 2025 - 18th International Conference on Health Informatics

208

the healthcare sector.

SQL injections, physical attacks, access control

breaches, and other database server attacks could the-

oretically have less impact if released information is

de-identiﬁed. Achieving a de-identiﬁed storage sta-

tus using the safe harbor method by eliminating ac-

cess to the 18 known identiﬁers of electronic health

records in databases at rest would prevent policy vi-

olations. This would lessen liabilities for healthcare

organizations and help reduce patients’ conﬁdential-

ity breaches while minimally impacting application

performance. Such a storage framework is also im-

mune to insider attacks and would offer ﬂexibility in

producing limited and de-identiﬁed datasets

2 LITERATURE REVIEW

This section explores various modalities designed to

increase the immunity of relational databases against

common relational database attacks. It covers a vari-

ety of solutions that attempt to solve the problem from

different angles.

To produce de-identiﬁed and limited datasets,

(Erdal et al., 2012) discusses an original framework

that uses structured data in relational databases-based

health information systems. The process performs ex-

tract, transform, and load (ETL) to produce a limited

dataset that can later be processed into a de-identiﬁed

dataset.

The objective of this study is to produce datasets

that can be used for research purposes that are com-

pliant with HIPAA and the local IRB. The researchers

took an interesting approach that relied on the manual

removal of HIPAA-protected data attributes through a

variety of methods. To isolate patient identiﬁers, the

researchers produced a multi tiered mapping between

patient identiﬁers and pseudo identiﬁers with a layer

of one-way hashing that prevents re-identiﬁcation

through classical SQL queries. For patients more than

90 years of age, the researchers stored their age in

a separate table while recording a maximum age of

89 at the level of the limited and de-identiﬁed. To

produce de-identiﬁed datasets, the researchers added

a random time increment to disguise dates. The pa-

per later proves the viability of their de-identiﬁcation

framework by running the framework on 20 mil-

lion records ensuring the efﬁciency of the framework.

The comparison between the original dataset and the

produced dataset highlighted a signiﬁcant increase

in CPU processing time (averaging 15 seconds per

execution) and a better performance when generat-

ing limited datasets in comparison with de-identiﬁed

datasets. This framework appears to be viable to re-

place the common manual de-identiﬁcation processes

required to produce limited and de-identiﬁed datasets

for reporting and research purposes. However, such

an approach does not offer any layer of protection to

the original datasets, which are still at risk of common

relational database vulnerabilities. Furthermore, the

model does not offer a real-time reﬂection of the data

available as it would require a signiﬁcant overhead if

deployed in a real-time production environment. The

proposed framework in this work offers a solution that

would minimally tax the application when protecting

health information systems databases.

The work presented in (Lin et al., 2016) at-

tempted to reach k-anonymity in relational databases

through an efﬁcient algorithm. To implement this,

records go through an anonymization algorithm that

would transform data points into a more generic ver-

sion of themselves to resemble (k) number of rows.

The authors suggested an original framework that

performed anonymization through three consecutive

modules. The pre-processing module was designed

to encode transactions to bitmaps that are sorted us-

ing the Gray order to approximate similar and clas-

sify them into several segments based on similarity

for efﬁcient anonymization. The Travelling Salesman

Problem (TSP) module was designed to ﬁnd a cycli-

cal loop between various transactions in each segment

rendering the TSP module more efﬁcient to ﬁnd the

cyclical loop between a lesser number of transactions

at a time. The anonymization module achieves k-

anonymity through the replacement of each class with

its center class as calculated by the module. This

results in data loss, which is necessary to achieve

anonymization.

The paper extended to undergo an experimental

evaluation resulting in anonymization for ﬁve real-

world datasets and one synthetic dataset. Their re-

sults showed a signiﬁcant increase in data loss with

an increased (k) value which is consistent with k-

anonymization by nature. Although the researchers

are presenting a reasonable algorithm that can achieve

anonymization, they illustrated signiﬁcant data loss as

anticipated with such an approach. Such data loss can

be intolerable in electronic health records or medical

records systems.

Authors of (El Emam, 2010) suggested a method

to measure the re-identiﬁability of health information

based on a conceptual ﬁve-level model of the identiﬁ-

ability continuum as shown in Figure 2. They concep-

tualize that datasets can reside in gray zones between

identiﬁable and non-identiﬁable statuses. The authors

classiﬁed health information datasets into ﬁve con-

ceptual categories that are ordinally classiﬁed based

on the possibility of re-identiﬁcation. Level 1, which

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach

209

presents datasets that are inclusive of readily identi-

ﬁable data such as names, social security numbers,

and addresses, is at the base of their model with a

higher risk of re-identiﬁcation. Level 2 constitutes

datasets with reversibly and irreversibly masked iden-

tiﬁers but accessible quasi-identiﬁers (i.e. dates, zip

codes, date of birth). Exposed data, or level 3 as

described by the suggested continuum, is masked

data that went through an obfuscation process for

quasi-identiﬁers rendering the dataset more resistant

to re-identiﬁcation but still considered as a high-risk

dataset. This is because the custodian did not assess

re-identiﬁability at this level. Level 4 is considered by

the authors as the ﬁrst level to host true de-identiﬁed

data. Aggregate extracts of the data were marked as

level 5, which is the highest form of de-identiﬁed data

as per the continuum. The paper argues that measur-

ing health data identiﬁability is possible based on de-

termining the possible risk associated with the data re-

quester. They extended to classify risks into three cat-

egories. Prosecutor risk is relevant when an adversary

(family member or knowledgeable personnel) is at-

tempting to re-identify data based on previous knowl-

edge of the subject and the fact that the subject exists

in the dataset. Journalist risk is relevant when an ad-

versary is attempting to re-identify based on the mere

existence of subject knowledge but not its existence

in the dataset. Finally, marketer risk, which repre-

sents attempts of mass re-identiﬁcation when compar-

ing datasets with each other such as a comparison be-

tween a de-identiﬁed health data with a national reg-

istry of citizens, which could result in re-identiﬁcation

of a subset of subjects in the dataset. The article ar-

gues that by establishing a custodian understanding of

the type of risk associated with each type of recipient

of the datasets, they would be capable of determining

a threshold on the identiﬁability continuum of types

of data at risk of re-identiﬁcation.

Although the authors’ opinion of the existence of

re-identiﬁcation of a dataset on a continuum is valid,

the suggestion that the risks are closely associated

with target recipients of data is not conclusive. Health

information is considered sensitive data with a signif-

icant attacker appetite. Therefore, it is prone to re-

siding with an audience that could be of interest in

exploiting information when the dataset is not prop-

erly de-identiﬁed. The framework proposed in this

work is designed to enable back-end applications to

produce all levels of datasets on the continuum with

the capacity to aggregate information into producing

level 5 datasets.

The authors of (Omran et al., 2009) discussed the

importance of health information exchange and the

various recent methods adopted to allow it among var-

Figure 2: Identiﬁability Continuum (El Emam, 2010).

ious health service providers. The paper also high-

lights that the higher the availability of healthcare

data, the more prone it is to conﬁdentiality and in-

tegrity breaches. The Healthcare sector is beneﬁting

from the utilization of web services to promote in-

teroperability, but it signiﬁcantly increases the possi-

bility of different attacks that would compromise the

conﬁdentiality and privacy of patients.

The paper discussed security mechanisms and is-

sues in both databases and web services. The ﬁrst

database security measure discussed was access con-

trol. The paper compared content-based access con-

trol that would only allow access to pre-tabulated

views, RBAC model, and System R access control.

They later proceed to discuss k-anonymity, aiming

to abstract data and release a version that is de-

identiﬁed. All subjects and attributes that can result in

the identiﬁcation are omitted, leaving a de-identiﬁed

version of the data that could be shared with minimal

conﬁdentiality and privacy risks.

Although there have been successful attempts

to re-identify patients from de-identiﬁed data, other

measures could be taken by policymakers and

database architects to minimize such inference. An

example of that is changing quantiﬁable attributes

such as age to nominal and ordinal data such as an

age group. K-anonymity solves the problem of health

information exchange to a great extent, but it does

not address security in the case of database vulner-

abilities of healthcare cloud applications. It results

in signiﬁcant data loss. The authors also discussed

database encryption and Hippocratic databases, both

approaches increase the level of security with sig-

niﬁcant performance impact in database encryption

methods.

The study continued to explain web services secu-

rity and suggested that network security should also

be addressed. As the scope of this work focuses on

dataset security in relational databases against suc-

cessful attacks, our focus was to study the different

database protection approaches in comparison to the

model proposed in this work.

Authors of (Omran et al., 2009) underwent a

case study of openEHR security and architecture.

openEHR is a widely used open-source electronic

health records system. The paper discusses some of

the security features of the application such as the

HEALTHINF 2025 - 18th International Conference on Health Informatics

210

segregation of demographics from clinical and ad-

ministrative data, version control, and access control

among others. The openEHR approach to the seg-

regation of data is to achieve anonymity by elimi-

nating any direct clues to identify the patients in the

EHR database. The application enforces the storage

of an EHR ID (ehr id) that serves as the foreign key

between two databases on different servers. This is

stored in a cross-reference database that could be en-

crypted. This approach attempts to achieve a simi-

lar goal to this work. However, the approach pre-

sented in this work offers a more efﬁcient and cost-

effective solution since the storage of information is

done on one database server. It also allows differ-

ent levels of protection based on the sensitivity of the

various data attributes. Due to the hierarchical na-

ture of the proposed framework, software adapting

this framework would be capable of producing lim-

ited or de-identiﬁed datasets on demand without sig-

niﬁcant overhead. The suggested framework enables

security policies and role-based access control due to

the multi-tiered classiﬁcation of sensitive data.

The work presented in (Avireddy et al., 2012) dis-

cusses the utilization of randomized encryption for

the protection of SQL injection attacks. The paper

presented the risks of SQL injection attacks and their

common incidence for cloud applications. The pa-

per aimed to develop an algorithm that uses random-

ized encryption to minimize conﬁdentiality breach

risk caused by SQL injection attacks. It extended to

develop an application that adapts the algorithm to

validate the immunity of this algorithm towards SQL

injection attacks.

The authors focused on SQL injection attacks

caused by poor validation on the application tier in

a three-tier cloud application setup. Such improper

validation might cause SQL statements to be passed

toward the database server, causing potential extrac-

tion of conﬁdential data, taking control, or disturbing

data integrity. SQL injection attacks using tautology

are very common in applications that do not enforce

proper front-end and back-end validation. One ex-

ample of that would be an attacker passing an SQL

statement to extract data or alter the original query

such attacks is the usage of incorrect queries to return

errors from the database server. Attackers usually use

this method to infer information about the database

structure which could later be used during the con-

struction of malicious SQL statements.

The Random4 algorithm is what the authors pro-

pose as their contribution to this paper. It constitutes

a method that is based on randomization to convert

input into an encrypted salt. The algorithm chooses

speciﬁc characters that could be used in salting any

front-end input before passing it back to the database

engine. Each character in the input is later mapped

into one of 4 probable characters based on the pre-

deﬁned lookup table relying on the character mapping

technique they chose to implement.

The paper suggests that encrypted input should be

stored within the database which would signiﬁcantly

decrease the impact of SQL injection attacks. They

attempted to develop an application written in C# to

encrypt incoming inputs to provide examples. The

proposed model would prevent piggy-backed queries

since passed input is changed and would not be mali-

cious once received on the SQL server side. The ini-

tial explanation of the Random4 algorithm described

that user input would be salted before being saved in

the database. However, the authors extended to de-

scribe that the various database attributes also fall un-

der this technique rendering it harder for attackers to

crack the structure of the database.

Random4 algorithm was tested on 5 different ap-

plications proving a signiﬁcant reduction in SQL in-

jection attacks when tautology is used, which is jus-

tiﬁable due to the nature of salting of input. Analyz-

ing the algorithm proposed by the work in [19], it is

clear that the encrypted input to SQL databases would

result in a signiﬁcant performance hit due to longer

and inconsistent strings as well as the overhead pro-

duced by encrypting and decrypting all input and out-

put from and to the database.

The work presented in (Oksuz, 2022) attempted

a different approach to achieve a similar goal of

database anonymization in electronic health records

databases. The proposed model relied on blockchain

to store identiﬁable information. The paper explains

a novel approach to anonymize patient records in an

EHR database through the storage of medical records

by a centralized authority in an off-chain database

while referencing a hash accessible in the blockchain.

Referentiality of records is lost and all encounters

have associated blocks in the chain accessible through

a hash provided by the patient. The researchers

highlight that although records can only be identi-

ﬁed through the presentation of a pseudo-identiﬁer

through the patient, institutions are still capable of

utilizing anonymized data for analysis and research

purposes.

The model in (Oksuz, 2022) allows central

providers trusted on a chain to add and create new

blocks that can be associated with individual encoun-

ters in the chain. Untrusted entities requesting ac-

cess to clinical encounters would have to use prede-

termined random hashes to access various blocks as-

sociated with the same record or clinical signiﬁcance

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach

211

to later consume hashes to query off-chain databases.

This requires patients to carry speciﬁc keys to all their

encounters – such as vaccination records – and would

not be able to verify the authenticity of the record un-

less provided with at least two blocks about the same

patient pseudo identiﬁer.

This model is the most similar, in terms of func-

tionality and results, to the work proposed in this

work. Due to the fact that authors did not introduce

solutions to some quasi-identiﬁers such as encoun-

ters dates, it achieves a resting state of the database

that is considered a limited dataset by HIPAA deﬁni-

tion, with pseudo identiﬁers available in a blockchain

only accessible through the respective block hashes.

However, in contrast to the work in this paper,

the model proposed by (Oksuz, 2022) requires a

resource-intensive setup and a complex integration of

off-chain and in-chain records, and due to the nature

of blockchain technologies, the performance impact

can be signiﬁcant (Koushik et al., 2019). Further-

more, the model relies heavily on patients present-

ing access keys for all entities to access their records.

This can be counterproductive in an emergency room

setting where patients are unconscious, or hashes are

lost. The model also did not take into consideration

the necessity of case progression and timeline analy-

sis in research applications, which is very crucial in

many types of clinical and public health studies and

is pivotal to achieving continuity of care.

None of the above literature adopts an approach

that encrypts foreign keys to de-identify data in

a tiered database instance, which is the proposed

method in this research. All relevant work focused

on using anonymization when data is being extracted

for research and statistical analysis.

3 FRAMEWORK DESIGN

This section determines the proposed framework to

achieve a secure de-identiﬁed databaseenabling the

transformation of relational healthcare databases with

electronic personal health information into true de-

identiﬁed databases at rest.

3.1 Determination of Identiﬁers and

Quasi-Identiﬁers in an EHR

Database

This section suggests a categorization matrix for com-

mon electronic health records system attributes to de-

termine their protection level. The work scores the

sensitivity level of each attribute on the identiﬁability

continuum proposed by (El Emam, 2010) and deter-

mines the level of accessibility of each attribute. This

is achieved by ranking attributes based on the identi-

ﬁability continuum and by classifying the frequency

of the attribute appearance on various EHR screens.

Figure 3 proposes a categorization matrix of the var-

ious data attributes based on the potential impact of

their conﬁdentiality breach.

Figure 3: Proposed categorization of EHR attributes.

As illustrated in 3, all EHR data attributes are clas-

siﬁed into 12 categories. Those categories offer a bal-

anced view of their sensitivity and frequency of ac-

cess. and determine their storage locations and pro-

tection policies throughout the framework. The paper

aims to classify all attributes encountered in the sam-

ple database including personal information, demo-

graphics, consultation information, and triage data.

The above categories fall in three protection levels

from highest to lowest priority as follows:

• Extremely Protected: this includes least frequent

identiﬁers, frequent identiﬁers, and least frequent

quasi-identiﬁers.

• Protected: this includes most frequent identiﬁers,

frequent quasi-identiﬁers, most frequent quasi-

identiﬁers, and all limited data attributes.

• Unprotected: all non-identifying data attributes.

The level of protection governs the level of stor-

age for the associated data attributes in the framework

database. The higher the level of storage is the more

iterations of foreign key decryptions needed to re-

trieve the data attribute as illustrated in the next chap-

ters.

3.2 Designing the Hierarchical

Structure of the Framework and

Determination of Quasi-Identiﬁers

Masking Methods

This section aims to outline the conceptual model of

the proposed framework. It suggests the 7 rules gov-

erning the design principles based on the protection

HEALTHINF 2025 - 18th International Conference on Health Informatics

212

level of the data attributes. Furthermore, the frame-

work considers the masking of some quasi-identiﬁers

such as dates while preserving the epidemiological

and clinical value.

The proposed framework assumes the preserva-

tion of the below rules:

3.2.1 Each Searchable Identiﬁable Attribute

Must Be Stored in a Separate Database

Table. this Eliminates the Possibility of

Cross Referencing One Identifying

Information Using Another

For searchable identiﬁable attributes, such as names

and unique identiﬁers, the proposed framework re-

quires the storage of no more than one in a single

database table. This ensures that attackers do not

get access to one identifying information because of

their previous knowledge of another (i.e., obtaining

a social security number using the patient’s name).

The highest-level table has only one encrypted for-

eign key to the next lower-level table. Lower-level

tables would have an encrypted foreign key for the

next lower-level record and another encrypted foreign

key for the previous higher-level table. The decryp-

tion of foreign keys within the software algorithm is

necessary to enable querying the next table in either

direction. Policymakers can choose whether to hier-

archically position those tables in a chained format or

to undergo grouping of attributes in various levels to

improve performance.

3.2.2 Non-Searchable Identiﬁable Attributes

Must Be Stored Encrypted in a Separate

Table

The proposed framework suggests that all non-

searchable identiﬁable attributes are stored in an

encrypted format within a separate database table.

No database table should contain columns of both,

searchable and non-searchable data attributes at any

time. Developers and security policymakers could

deﬁne whether one database table contains all non-

searchable identiﬁable attributes or choose to segre-

gate each attribute or group of attributes in a different

table depending on their relation to each other. Sep-

arating attributes in different tables renders the pro-

posed framework more resistant to attacks, increases

security, and further dilutes the potential harm of

database attacks. One important aspect is that the

more complex the model adopted, the less perfor-

mance is anticipated.

3.2.3 Non-Identiﬁable Searchable and

Non-Searchable Attributes Should Be

Stored in the Reference Database Table

This framework suggests that non-identiﬁable at-

tributes should be stored based on common database

structures practice. Encryption is not a necessity

for the attributes. However, the table contains the

encrypted foreign key for the lowest level identify-

ing table in the framework. This table is called the

reference table in this work. This table’s primary

key can be referenced by a foreign key in various

clinical encounters utilizing relational database fea-

tures. The proposed framework ensures that the im-

pact on performance is kept to the minimum when

using aggregate functions for the analysis purposes

of deidentiﬁed health and administrative information.

Therefore, reverse identiﬁcation is possible besides

regular storage of non-encrypted attributes in the ref-

erence table. The proposed framework leaves the

structuring and normalizing of database decisions for

the system analysts and database administrators.

In case of migration to this model, analysts and

database administrators need minimal changes to

database tables below the reference table to address

quasi-identiﬁers such as clinical encounter dates and

caregiver information. Patient identiﬁcation tables are

replaced by the reference table and database integrity

should not be affected assuming that the database has

a reasonably normalized design. This is because the

identiﬁcation of patients and personnel only occurs at

one table and health information and encounters refer

to identiﬁcation tables through a foreign key.

Developers and security policymakers may

choose to enhance security further and reduce the

impact of database attacks by encrypting non-

identifying attributes. However, this could have a

signiﬁcant performance impact on reporting and

application data aggregation power. For this paper,

such extra encryption is out of scope and could be

revisited in future work.

3.2.4 Health Encounters and Administrative

Records Can Only Be Referenced Through

Reference Tables Producing Limited

Datasets

This approach requires that health encounters, admin-

istrative records, or any record protected under the

HIPAA privacy rule or the security policy of the or-

ganization or the region should contain a foreign key

-when needed- to the reference table containing non-

identiﬁable information. At this point, any identiﬁers

and quasi-identiﬁers, except encounters and admin-

istrative records dates, would be stored in a higher-

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach

213

level table above the reference table. Querying these

attributes would require decrypting one or more en-

crypted foreign keys through the application backend.

Furthermore, the resting state of the database instance

is de-identiﬁed and any direct SQL injection attack or

any other query that could be executed on the database

server is not able to retrieve any electronic personal

health information. However, it would be able to re-

trieve dated electronic health information, which is a

limited dataset. This is addressed in the next rule.

3.2.5 Dates Must Be Masked Through a

Random Increment or Decrement that Is

Different for Each Patient

To achieve a true de-identiﬁed state, dates should be

masked as they are considered quasi-identiﬁers and

could be utilized by attackers with previous knowl-

edge of encounter dates. The framework proposes a

random increment or decrement that is considered to

the framework as a non-identifying non-searchable at-

tribute. As discussed in design principle 3, this inter-

val is stored in a separate database table or with other

non-searchable identiﬁable attributes. Such a static

interval per patient would ensure visibility to proper

clinical case timeline, physicians can still observe the

progression of clinical cases on realistic timelines. A

consideration when developing the interval determi-

nation algorithm should be the fact that seasonality

and endemic information could be disrupted while

they are of high clinical signiﬁcance. To overcome

this, policymakers can allow caregivers to re-produce

limited datasets or re-identify information while ad-

hering to the health information security and privacy

that apply to their organizations, for example, the

HIPAA privacy rule requires logging any access to

limited or identiﬁed datasets.

3.2.6 Incremental Primary Keys Are not

Allowed for any Table Above the

Reference Table

With incremental primary keys, attackers can infer the

relation between two tables through a sequence, an

attacker might relate that entry number 1 in the pa-

tient names table refers to entry number 1 of the social

security numbers table. To ensure maximal security,

and to inhibit the ability of attackers to infer personal

health information based on the sequence of insert

and creation or timestamps. The proposed framework

requires that incremental primary keys be not allowed

for any of the database tables above the reference ta-

ble. As an example, developers can use universally

unique identiﬁers (UUID) for tables without a unique

attribute.

3.2.7 Encryption Keys Should Be Different for

Every Encrypted Foreign Key

Developers and policymakers achieve a higher level

of protection when encryption keys are different for

each foreign key, this ensures that if any foreign key

is jeopardized through the software interfaces, the ap-

plication tier, or improper key management, other for-

eign keys would still be protected and immune to the

attack and the data leak would be to a minimum. Key

management methods and procedures are out of scope

for this work but could be examined in future work.

Figure 4: Proposed schema of EHR data attributes.

Adopting the schema proposed in Fig. 4 through

implementing the 7 rules proposed in this section

would result in loss of referential integrity between

the various tables above the reference table. This is

caused the encrypted foreign keys that are handled the

application layer. This referential integrity loss was

observed in all methodologies segregating identiﬁers

and quasi-identiﬁers from health encounters. How-

ever, unlike other proposed work examined in this

work, this framework sustains a one-to-one relation-

ship between all tables above the reference table with

no change to referential integrity with the clinical en-

counters. This simpliﬁes implementing integrity con-

straints at the application layer to compensate for the

referential integrity loss anticipated by adopting this

framework.

3.3 Applying the Framework to an

EHR Database

This section aims to apply this framework to a con-

ceptual EHR database, during this phase a typical

SQL database schema for an electronic health records

system is used to adapt the framework, the software

database is structured to have 70 different tables that

cover primary, secondary, and tertiary health care ser-

vices. For this study, we de-identify the database ta-

HEALTHINF 2025 - 18th International Conference on Health Informatics

214

bles that store only primary health records, admin-

istration records, and patient identiﬁcation tables to

provide an example for a wider use.

Figure 5: Original database structure.

Figure 5 shows the original structure of the

database. Three database tables are de-identiﬁed us-

ing the framework proposed structure for this work.

• Patients Information Table: This table contains

identiﬁers and quasi-identiﬁers of all patients pro-

cessed through the software.

• Triage Information Table: This table contains in-

formation from various nursing stations in pri-

mary healthcare settings, the information stored in

this table are the patient’s vital signs, chief com-

plaints, and a few administrative attributes along

with service received dates.

• Clinical Consultations Table: This table contains

the consultation information of all cases. It stores

all consultation details as well as the diagnosis

and the treatment plan of the patient, which are

the most sensitive information to be protected.

To restructure the schema presented in Figure 5

abiding to the seven design principles outlined previ-

ously in this work, a categorization of data attributes

need to take place based on the frequency and iden-

tiﬁably of the attributes. This categorization helps

in identifying the correct level to store each attribute

in the new schema. We concluded the categoriza-

tion highlighted in Figure 6 of patient personal in-

formation attributes based on typical electronic health

records systems functionality.

Given the encryption of foreign keys in the ap-

plication database, it is anticipated that the database

Figure 6: Attributes categorization using the identiﬁability

continuum.

loses referential integrity between the higher-level ta-

bles. Software developers can implement various

constraints at the application layer to ensure referen-

tial integrity between those tables.

3.3.1 Step 1: Searchable Identiﬁable Attributes

Must be Stored in a Separate Database

Table

To adapt to the ﬁrst rule, identiﬁable information

that is searchable “Non-Encrypted” would have to be

stored in separate tables to preserve conﬁdentiality.

This ensures that an attacker gaining access to the

database would not be able to infer, or correlate iden-

tiﬁers based on previous knowledge of one or more

of those identiﬁers. An example would be an attacker

aware of a speciﬁc patient’s name being able to re-

trieve their social security number due to their storage

in one row. Both the patient’s full name and patient

unique identiﬁer attributes highlighted in Figure 6 are

considered searchable (frequent) identiﬁers. To abide

by the framework rules, they are stored in separate

database tables both with encrypted foreign keys re-

lating to the next level table. Developers can choose

to hierarchically structure those tables as separate lev-

els of this framework or store the various attributes on

the same level. If a similar level is chosen develop-

ers must use different encryption keys or vectors for

both tables. This ensures that the encrypted foreign

key value is not similar in both tables and cannot be

used for reference.

Figure 7 illustrates the implementation of the

framework for searchable identiﬁable attributes while

storing each attribute in a separate level. This repre-

sents a more secure design in comparison to Figure 8.

However, this design requires back-end functions to

iterate through subsequent levels to arrive at the ref-

erence table used to relationally reference the clini-

cal encounters resulting in performance degradation.

This work uses this design in the database transforma-

tion and all results are shared to the most secure im-

plementation of this framework. Developers and or-

ganizations can weigh the security beneﬁts in compar-

ison to the performance impact to determine whether

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach

215

Figure 7: Searchable identiﬁable attributes must be stored in

a separate database table for each attribute, separate levels

design (more secure).

Figure 8: Searchable identiﬁable attributes must be stored

in a separate database table for each attribute, same level

design (less secure).

to add their searchable identiﬁable attributes in one or

multiple levels.

To ensure that users with access to the database

are unable to utilize relational features to infer ePHI

or any other identiﬁable information. The encryption

key or encryption vector should be different for Pa-

tient Ids, Patient Names, and Patients tables. To en-

sure proper adoption of the seventh rule of the frame-

work, the following rules must be true:

• In Figure 7: Patient Ids.encryptedNextPK ̸= Pa-

tients.encryptedPrevPK.

• In Figure 8: Patient Ids.encryptedNextPK ̸= Pa-

tient Names.encryptedNextPK.

3.3.2 Step 2: Non-Searchable Identiﬁable

Attributes Must be Stored Encrypted in a

Separate Table

To abide by this rule, all identiﬁers and quasi-

identiﬁers that are non-searchable must be removed

from the reference table and preferably stored in an

encrypted format. As discussed previously, policy-

makers can choose whether to store both identiﬁers

and quasi-identiﬁers at the same level or not as well

as the possibility of grouping those attributes accord-

ing to their relevance and the cross-matching possibil-

ity of re-identiﬁcation. For example, if the insurance

identiﬁcation number is stored in the same table with

the name or id of the insurance provider, the possibil-

ity of inference increases if attackers gain access to

information in that speciﬁc table. Policymakers can

choose to store those two attributes in separate levels

of the framework to increase security keeping in mind

the potential impact on performance. Both implemen-

tations are illustrated in Figures 9 and 10.

Figure 9: Non-searchable identiﬁable attributes must be

stored encrypted in a separate table. separate levels design

(more secure).

To prevent violation of the framework rules, the

encryption key or encryption vector should be differ-

ent for Patient Ids, Patient Names, Patients RIds, Pa-

tients QIds, and Reference table tables. This results

in the following rules to be true:

• Patient Ids.encryptedNextPK ̸= Pa-

tients RIds.encryptedPrevPK.

• • Patient Names.encryptedNextPK ̸= Pa-

tients QIds.encryptedPrevPK.

• • Patient RIds.encryptedNextPK ̸= Refer-

ence table.encryptedPrevPK.

Although Figure 10 illustrates a higher-

performing design, it is less secure than the model

illustrated in Figure 9.

3.3.3 Step 3: Non-Identiﬁable Searchable and

Non-Searchable Attributes Should be

Stored in the Reference Database Table

As discussed in the previous section, successful im-

plementation of rules 1, 2, and 7 from the frame-

work results in only non-identiﬁable information to

HEALTHINF 2025 - 18th International Conference on Health Informatics

216

Figure 10: Non-searchable identiﬁable attributes must be

stored encrypted in a separate table. Single table design

(less secure).

be stored in the reference table. This allows EHR

solutions to leverage relational database features to

process and manipulate data stored within the ref-

erence table without any decryption required by the

backend application or any change in the application’s

logic. Researchers continue to have the ability to

leverage EHR data for public health, disease surveil-

lance, and other research purposes. This is because

the framework preserves non-identiﬁable information

in the reference table. More complex research stud-

ies such as time series and case studies would require

speciﬁc algorithms to be implemented in the back-end

application.

3.3.4 Step 4: Dates Must be Masked Through a

Random Increment or Decrement that is

Different for Each Patient

To achieve a higher level of security for databases in

their resting state, dates and timestamps are masked

as they are considered quasi-identiﬁers. Attackers can

infer patients’ identiﬁable information if they have

pre-existing knowledge of clinical encounters’ time

stamps. To achieve this, this model proposes a ran-

dom increment or decrement automatically assigned

to each patient at the registration stage. This factor

is used to mask all records associated with the pa-

tients in all tables regardless of their location within

the framework. The random static factor ensures that

data integrity and quality for researchers remain intact

for case progression and time series analysis. How-

ever, seasonality might be lost which could be tackled

by using a less secure year increment/decrement fac-

tor. Given that dates are considered quasi-identiﬁers

in health applications, the factor is stored at a similar

level with other quasi-identiﬁers.

Figure 11 illustrates the ﬁnal proposed database

structure implementing all rules of the proposed

Figure 11: Final Proposed database structure, abiding to all

framework rules.

framework. The main patient’s table illustrated in

Figure 3.2 was split into 5 smaller tables with en-

crypted foreign keys referring to the next and previ-

ous level tables. The framework preserved the origi-

nal patients’ identiﬁers in all encounters tables and re-

sulted in no change in their structure enabling a sim-

pler adoption of the framework due to the minimal

changes needed in software logic to reﬂect change.

The ﬁnal proposed database structure stores

searchable identiﬁable attributes in separate tables

structured in a hierarchical manner (Rule 1). It

also stores less frequently used identiﬁers at a dif-

ferent level from quasi-identiﬁers (Rule 2). All non-

identifying attributes are stored in the reference table

for easy retrieval (Rule 3). All encounters’ tables refer

to the reference table (Rule 4) preserving all the refer-

ential features of relational databases with no change

reducing the impact on database utilization of data

aggregation and research. The structure extends to

store the date increment/decrement factor separately

for each patient within the database (Rule 5) and abid-

ing by (Rule 6) at all levels above the reference table.

The framework utilizes universally unique identiﬁers

(UUIDs) limiting the possibility of inference due to

relevant records ordering in all tables. When imple-

menting this structure, encryption keys or vectors are

different for different levels ensuring that relational

features of the database are not used to re-identify

data (Rule 7).

The implementation of this approach in pre-

existing EHR systems would require those changes

to be executed at tables containing patient identiﬁers,

with minimal to no changes at the clinical encounters

tables. For applications utilizing best practices such

as database normalization and modular application

design, the implementation of this approach would

be relatively simple and very similar to the approach

presented in this paper. Efﬁcient utilization of this ap-

proach would highly rely on adopting an application

Novel Approach to De-Identify Relational Healthcare Databases at Rest: A De-Identiﬁcation of Key Data Approach

217

design that minimizes re-identiﬁcation. Example ap-

proaches would be to temporarily store de-identiﬁed

patient information in the session as providers are

navigating the relevant patient history or utilizing

object oriented architectures where re-identiﬁcation

would happen once for active patient’s object.

4 CONCLUSION AND FUTURE

WORK

De-identiﬁcation of medical data has been a widely

used solution to produce data extracts for research

and analysis. The work was able to identify 7 frame-

work principles that could de-identify any health in-

formation system database through Segregation of

identiﬁable information in separate database tables

based on their importance and frequency, omitting the

use of relational database features between those ta-

bles through encryption of foreign keys, and address-

ing quasi-identiﬁers such as encounters dates through

masking with a random increment/decrement that is

stored in the same manner. The de-identiﬁcation of

a sample EHR schema database was successful mi-

grating the original database structure to a structure

conforming to the 7 principles of the framework.

In future work, our aim is to test the framework

on a real-life EHR database and compare the perfor-

mance against the original to determine the suggested

framework efﬁciency.

REFERENCES

Avireddy, S., Perumal, V., Gowraj, N., Kannan, R. S., Thi-

nakaran, P., Ganapthi, S., Gunasekaran, J. R., and

Prabhu, S. (2012). Random4: An application speciﬁc

randomized encryption algorithm to prevent sql injec-

tion. In 2012 IEEE 11th International Conference on

Trust, Security and Privacy in Computing and Com-

munications, pages 1327–1333.

Caplan, R. (2003). Hipaa. health insurance portability and

accountability act of 1996. Dental assistant (Chicago,

Ill. : 1994), 72:6–8.

Capris, T., Melo, P., Garcia, N. M., Pires, I. M., and

Zdravevski, E. (2022). Comparison of sql and nosql

databases with different workloads: Mongodb vs

mysql evaluation. In 2022 International Conference

on Data Analytics for Business and Industry (ICD-

ABI), pages 214–218.

Chhabra, S., Amiri, H., Rastegar, M., and Dashti, A. (2022).

Cloud computing for healthcare systems in covid19

era. Open Access Research Journal of Biology and

Pharmacy, 06.

El Emam, K. (2010). Risk-based de-identiﬁcation of health

data. IEEE Security & Privacy, 8(3):64–67.

Enterprise, V. (2018). 2018 data breach digest report.

Erdal, B., Liu, J., Ding, J., Chen, J., Marsh, C., Kamal, J.,

and Clymer, B. (2012). A database de-identiﬁcation

framework to enable direct queries on medical data for

secondary use. Methods of information in medicine,

51:229–41.

Koushik, A. S., Jain, B., Menon, N., Lohia, D., Chaudhari,

S., and B.P, V. K. (2019). Performance analysis of

blockchain-based medical records management sys-

tem. In 2019 4th International Conference on Recent

Trends on Electronics, Information, Communication

& Technology (RTEICT), pages 985–989.

Lin, J. C.-W., Liu, Q., Fournier-Viger, P., and Hong, T.-

P. (2016). Pta: An efﬁcient system for transaction

database anonymization. IEEE Access, 4:6467–6479.

Miller, A. and Payne, B. (2016). Health it security: An ex-

amination of modern challenges in maintaining hipaa

and hitech compliance. 2016 KSU Conference on Cy-

bersecurity Education, Research and Practice.

Oksuz, O. (2022). A System For Storing Anonymous Pa-

tient Healthcare Data Using Blockchain And Its Ap-

plications. The Computer Journal, 67(1):18–30.

Omran, E., Bokma, A., and Abu-Almaati, S. (2009). A k-

anonymity based semantic model for protecting per-

sonal information and privacy. In 2009 IEEE Interna-

tional Advance Computing Conference, pages 1443–

1447.

OWASP (2024). Owasp top ten.

Patel, D., Dhamdhere, N., Choudhary, P., and Pawar, M.

(2020). A system for prevention of sqli attacks. In

2020 International Conference on Smart Electronics

and Communication (ICOSEC), pages 750–753.

Rights, P. (2023). Privacy breaches.

Sai Lekshmi, A. S. and Devipriya, V. S. (2017). An emula-

tion of sql injection disclosure and deterrence. In 2017

International Conference on Networks & Advances

in Computational Technologies (NetACT), pages 314–

316.

Singh, A. (2024). Evolutionary architectures in web ap-

plications: A comprehensive study of client-server,

multi- tier, and service-oriented approaches. IJFMR.

Unlu, S. A. and Bicakci, K. (2010). Notabnab: Protection

against the “tabnabbing attack”. In 2010 eCrime Re-

searchers Summit, pages 1–5.

HEALTHINF 2025 - 18th International Conference on Health Informatics

218