StorageBIT

A Metadata-aware, Extensible, Semantic and Hierarchical Database for Biosignals

Carlos Carreiras

, Hugo Silva

, Andr

e Lourenc¸o

1,2

and Ana Fred

Instituto de Telecomunicac¸

oes, IST-UTL, Lisbon, Portugal

Instituto Superior de Engenharia de Lisboa, ISEL, IPL, Lisbon, Portugal

Keywords:

Biosignal, Database, Metadata, File Format, Semantics.

Abstract:

Acquisition of biomedical data is, nowadays, widespread, originating a deluge of data that may contain rele-

vant and interesting information for health-care professionals, biosignal researchers, and the individuals them-

selves. This creates the need to organize the information in a structured way, facilitating collaboration and

research efforts. Therefore, for that purpose, this paper investigates database systems and ﬁle formats, dis-

cussing current technologies, requirements and possible implementations. These implementations were put

through a benchmarking package to analyze their insertion, query and update performance. A ﬁnal approach

combining the use of HDF5, a hierarchical ﬁle format for numerical data, and MongoDB, a NoSQL database,

is proposed, as it showed the best combination of properties from the tested solutions.

1 INTRODUCTION

The ability to measure biomedical signals (biosignals)

is no longer restricted to specialized environments

and expensive equipment. Indeed, recent advances

in wireless protocols, ubiquitous computing, and sen-

sor technology have made the acquisition of biosig-

nals widespread. Companies like Nike

, Zeo

, and

Fitbit

all currently commercialize products to track

various types of information, such as the number of

steps, calories, sleep, weight, heart rate, etc.. This

technological revolution has given way to a new con-

cept of healthy living, the so-called Quantiﬁed Self

movement, whereby objective, quantiﬁed information

can lead to insights that prompt behavior change, re-

sulting in improved health (Kvedar, 2011).

This opens access to vast amounts of data that,

ﬁrst and foremost, need to be stored. Only then

is it possible to apply signal processing algorithms

(Vaseghi, 2006), pattern recognition, multi-modal

data fusion and classiﬁcation techniques (Jain et al.,

2000) to extract meaningful information from the ac-

quired signals. This work is thus focused on the prob-

lem of data storage and access in the context of biosig-

nals, exploring the requirements, available technolo-

gies and efﬁciency aspects of choosing a database sys-

www.nike.com/fuelband

www.myzeo.com/sleep

www.ﬁtbit.com

tem and developing its interface. This is done in two

perspectives. First, the format on which the data is ac-

tually stored must be speciﬁed and implemented, and

second, the database itself needs to be deployed.

The remainder of this paper is organized as fol-

lows. We start with a review of the state-of-the-

art, pointing out limitations of current approaches.

Then, the basic requirements are presented, dis-

cussing available technologies for the storage of

biosignals. Subsequently, various implementation al-

ternatives are analyzed, and then evaluated with a set

of performance tests. Finally, the main conclusions

are outlined, specifying the chosen implementation

and future work.

2 STATE-OF-THE-ART

Databases for biosignals are essential for scientiﬁc

development, research in biomedical signal process-

ing and in health-care applications. However, exist-

ing attempts to standardize a ﬁle format for the stor-

age and sharing of biosignals have several limitations,

which hamper research efforts. Indeed, standardiza-

tion allows for easier collaboration, which improves

the quality of developed software, exchange of sci-

entiﬁc results and health-care practices, while also

reducing costs, and leading to the discovery of new

knowledge (Varri et al., 2001). In reality, many ﬁle

Carreiras C., Silva H., Lourenço A. and Fred A..

StorageBIT - A Metadata-aware, Extensible, Semantic and Hierarchical Database for Biosignals.

DOI: 10.5220/0004241400650074

In Proceedings of the International Conference on Health Informatics (HEALTHINF-2013), pages 65-74

ISBN: 978-989-8565-37-2

 2013 SCITEPRESS (Science and Technology Publications, Lda.)

formats exist, with well over 100 documented for-

mats (Brooks et al., 2011), such as the Extensible

Biosignal File Format (EBS) (Hellmann et al., 1996),

the European Data Format (EDF+) (Kemp and Oli-

van, 2003), the Medical Waveform Format Encod-

ing Rules (MFER) (MFER, 2003), and the WaveForm

DataBase (WFDB) (Goldberger et al., 2000). Many

formats are proprietary, limiting the access to the in-

formation contained in the ﬁles. Most are designed

for a speciﬁc purpose, such as the storage of electro-

cardiogram (ECG) signals, and there is a general lack

of structured metadata, that is, all the semantic infor-

mation that describes how the signals were acquired

and what was done with them.

It should be noted that peer-reviewed literature on

this topic is fairly limited. A quick search for ”biosig-

nal database” in IEEE’s Xplore



Digital Library pro-

duces 28 results. Of these, only four directly address

the issues of specifying and implementing a system

for the storage of biosignals. For instance, Penzel et

al. (Penzel et al., 2001) describe the approach used

to store polysomnography (PSG) data, produced in

the scope of SIESTA, an European project carried out

in 2001. The European Data Format was used, with

the need to specify strict ﬁlename conventions (e.g.

encoding the subject identiﬁcation, or the recording

site), and the internal structure of the ﬁles (i.e. the or-

der of the signals in the ﬁle and their sampling rates).

In the same year, Lovell et al. (Lovell et al., 2001) de-

tailed a framework for web-enabled storage of biosig-

nals, where it is already noted the absence of a com-

mon ﬁle format. The development of browser-based

applications is discussed and access-load issues are

addressed.

More recently, the focus has been given to se-

mantic approaches (see, for example, (Brooks et al.,

2011), (Kokkinaki et al., 2008), and (Brooks, 2009)),

where, in addition to the core biosignal data, contex-

tual information is included, such as patient and ac-

quisition procedure information, enabling the integra-

tion of disparate and heterogeneous sources of med-

ical information and facilitating their query and re-

trieval.

It is easy to grasp that current methods exhibit

some glaring limitations. In particular, it is common

to have the signal data separated from metadata char-

acterizing the experimental setup. Therefore, possess-

ing the data ﬁles alone is of limited interest, because

the acquisition context necessary to analyze them can-

not be accessed. Furthermore, current ﬁle formats

usually have a ﬁxed number of metadata ﬁelds, with

limited size, employing weak semantics (while cer-

tain metadata ﬁelds are self-explanatory – sampling

frequency, resolution, etc. – others may not be – e.g.

”channel” may refer to the actual samples of a sig-

nal, its label, or the identiﬁcation of the physical in-

put of the acquisition system). Additionally, current

approaches make it hard to append new information

to an already existing ﬁle (e.g. a ﬁltered version of

the biosignals, or a comment about the data). And,

ﬁnally, these approaches provide poor interfaces to

the user, with very limited or inexistent query sup-

port. These limitations provide our motivation to

build upon the current state-of-the-art and develop an

extensible, semantic and hierarchical infrastructure.

3 STORING BIOSIGNALS

In response to the limitations found in previous work,

a list was curated with the most important aspects

and properties a system that stores biosignals should

exhibit. Based on these requirements, a data model

was speciﬁed and various implementations of the data

model were investigated.

3.1 Requirements

The following properties were used to evaluate and

compare the various ﬁle formats under study: 1) Ac-

cess Performance: Read and write speed, non-

sequential access, data compression, etc.; 2) Cross-

Platform Support: The availability of tools for

different operating systems and programming lan-

guages; 3) Events Support: Events and annotations

are textual comments or values related to a particular

signal, or to the acquisition session as a whole (e.g.

the location of R waves in the ECG). This type of data

is very important, given that only through the evalua-

tion of annotations a human user or a computer algo-

rithm can learn the meaning of speciﬁc signal patterns

(Penzel et al., 2001); 4) Extensibility: The ability to

easily add more data to a ﬁle; 5) Metadata: Deﬁned

as data about data, pertains to all the additional infor-

mation that characterizes the acquired signals. It in-

cludes general and particular attributes of the biosig-

nals, when, how and by whom the acquisitions were

made, their purpose, and what processing has been

applied. Fields for metadata should be extensible (al-

lowing to add more information along the way) and,

more importantly, should have meaning, this is, the

use of a controlled vocabulary to specify content, e.g

using ontologies (McGuinness, 2003). Some com-

mon metadata ﬁelds can be seen in Figure 1. The

use of metadata allows for knowledge to be processed

computationally in a comparable way to numeric data

(Brooks, 2009); 6) Multi-modality: The capability to

store various signal types in a single container struc-

HEALTHINF2013-InternationalConferenceonHealthInformatics

Table 1: Comparison of ﬁle formats and NoSQL DBMSs.

(a) Comparison of ﬁle formats.

Property

File Format

EBS EDF+ MFER WFDB HDF5

Compression - - - - +++

Non-Sequential Access - - - - +++

Cross-Platform Support + ++ ? ++ ++

Events Support ++ + - ++ ++

Extensibility - - - - +++

Metadata ++ + + + +++

Multi-modality - + - + +++

Querying - - - - -

(b) Symbols used in Table (a).

Symbol Description

? Information Not Available

- Not Supported

+ Supported

++ Good Support

+++ Excellent Support

Property

NoSQL DBMS

CouchDB MongoDB

Flag Line

DB consistency, ease of use

Scalable, high-performance, retains

friendly properties of SQL

Interface

HTTP/RESTful BSON

Queries

MapReduce Javascript expressions

Replication

Master-Master Master-Slave, replica sets

Others

Multiversion Concurrency Control,

attachment handling

Memory-mapped ﬁles, sharding,

embedded ﬁle system (GridFS)

Rec o rd = {

’ Header ’: { ’ Sess i o n ’: ’#’ ,

’ Date ’: ’ yyyy -mm - d d Thh : mm : ss ’ ,

’ Sup e r viso r ’: ’ txt ’ ,

’ Exp e r imen t ’: { ’ Name ’: ’ txt ’ ,

’ Des c rip t ion ’: ’ txt ’,

’ G oals ’: ’ txt ’} ,

’ Subject ’: { ’ Name ’: ’ txt ’ ,

’ Birthda y ’: ’ yyyy - mm - dd ’,

’ Ge n d e r ’: ’txt ’ ,

’ Contact ’: ’ txt ’}} ,

’ Tags ’: [ ’ txt ’, ...] ,

’ A udit ’: {} ,

’ Bios i g nal ’: { ’ Type ’: ’ txt ’,

’ De v i c e ’: ’txt ’ ,

’ Sa m ple Rate ’: ’#’ ,

’ Duratio n ’: ’# ’ ,

’ Res o l utio n ’: ’# ’ ,

’ U nits ’: { ’ Time ’: ’ txt ’ , ’ Sensor ’:

’txt ’}} ,

’ Ev e n t s ’: { ’ Source ’: ’ txt ’ ,

’ Type ’: ’ txt ’ ,

’ Dic t i onar y ’: { ’ key ’: ’ value ’}} }

Figure 1: Hierarchical structure of a Record, in a JSON-style notation, as speciﬁed by the Data Model, with some common

metadata ﬁelds.

ture. These signals may be of different nature (e.g.

ECG, EEG, PSG, etc.), and possibly acquired at dif-

ferent sampling rates, or even be mere events, thus

requiring appropriate accompanying information and

data structures; and 7) Querying: A query is a precise

request for retrieval of information that shares com-

mon properties within a database or an information

system. In particular, we are interested in the inclu-

sion of a system of tags, generated by biosignal re-

searchers and experts, leveraging the potential power

of folksonomies (Wal, 2007). In Table 1(a) the EBS,

EDF+, MFER and WFDB ﬁle formats are compared

based on these criteria.

Additionally to these requirements, the three fol-

lowing general aspects should also be taken into con-

sideration: 1) Accessibility: Data must be accessi-

ble to multiple people, possibly in multiple locations.

Therefore, the information must live on the cloud,

taking advantage of current web-based technologies;

2) Scalability: The ability of a system to respond to a

growing amount of work. This can be done in a verti-

cal sense (by adding more resources to a single node

in a system) or in a horizontal sense (by adding more

nodes to a system). Additionally, the intrinsic soft-

ware elements of the system should also be as efﬁcient

as possible. In the context of biosignals, the bottle-

neck lies, predominantly, in storage constraints. For

instance, a night-long acquisition of PSG data easily

exceeds the 1GB threshold. Thus, the system must

deal with large data volumes, and must process it (i.e.

put and retrieve operations) quickly; and 3) Flexibil-

ity: The system must sort through records from dif-

ferent experiments and through data of various types.

Therefore, it needs to be ﬂexible enough to accommo-

date these different needs, and to be able to cope with

changing requirements.

StorageBIT-AMetadata-aware,Extensible,SemanticandHierarchicalDatabaseforBiosignals

Start Date Start Time

Duration

Version

Reserved

Number

of Bytes

Number of

Data Records

Recording ID

Patient ID

( NS xLabel

Transducer

Physical Dimension

Physical Maximum

Digital Maximum

Digital Minimum

Preltering

Number of samples

Reserved

( NS x

( NS x )

)

( NS x

Physical Minimum

( NS x )

HEADER

(a) EDF+ header structure, with ﬁxed-length ﬁelds on the left and variable-length ﬁelds on the right, where each sharp-edged square represents 8 ascii

characters (64 bytes).

Signal 1 )

( Samples x

. . .

Signal NS )

( Samples x

DATA RECORD

(b) EDF+ data record structure, storing multiple (NS) biosignals; each round-edged square represents one byte.

Figure 2: Schematic representation of the EDF+ ﬁle structure; NS is the number of signals stored in the ﬁle.

3.2 Data Model

The basic abstract entities for storage are Records.

Records are equivalent to acquisition sessions, in the

sense that all the signals from one acquisition session

are stored in one Record. The signals should be con-

textualized with the experiment (i.e. the purpose of

the acquisition), the details of the subject, and com-

plemented with any relevant metadata, such as a de-

scription of the processing steps necessary to obtain

the signal (to audit the processing method), the phys-

ical units of the signals, the labels of the channels, or

annotations made by the acquisition’s supervisor. In

this way, multiple users could access the same data,

process it, and share any obtained results in a way

that is understandable by everyone, in a collaborative

effort.

With these principles in mind, an integrated

database management system for biosignals was de-

signed, in which the acquired records are sorted

by subject and, subsequently, by experiment. The

Records are composed by four main ﬁelds (see Fig-

ure 1): 1) Header: Identiﬁes the ﬁle and contains

basic information about the stored biosignals, spec-

ifying date and time of acquisition, who performed

the acquisition to whom, the experimental context,

and any automatically or manually generated index-

ing tags; 2) Audit: This ﬁeld is intended to list a de-

tailed history of the ﬁle, i.e., a systematic review of

what was done to the biosignals. It would summa-

rize, for instance, which ﬁlters were applied, together

with the corresponding parameters, or the processing

steps used; 3) Biosignals: The signals of interest, i.e.,

synchronous data; 4) Events: Any events or annota-

tions made asynchronously from the recorded biosig-

nal time series.

It should be noted that no element of the data

model has a predeﬁned or expected volume of infor-

mation, being as big as necessary and extensible. Fur-

thermore, the model is designed to be open. Elements

can be added or removed, according to the use case.

This makes it possible to adapt the model to any type

of ﬁeld structure, and, therefore, current ﬁle formats

can be easily mirrored in the Data Model.

As an example of how the Data Model compares

to one of the most widely used biosignal ﬁle formats,

the internal structure of the EDF+ format is shown in

Figure 2. An EDF+ data ﬁle is composed by a header

followed by data records. The header identiﬁes the

subject and describes the technical parameters of the

recording, while the data records contain the time se-

ries data. Although the header has variable-length, it

HEALTHINF2013-InternationalConferenceonHealthInformatics

is only to support the multiple data records stored in

the ﬁle, as the ﬁelds for each data record themselves

are dimensionally limited. Furthermore, the format

does not allow to store additional, user-deﬁned meta-

data ﬁelds. Figure 3 shows how the proposed Data

Model can be used to mirror the EDF+ format. The

main difference is that, with the Data Model, meta-

data pertaining to a certain biosignal is stored in the

same hierarchical node as that biosignal, thus explic-

itly deﬁning the ownership of the metadata. Addition-

ally, it is easy, with this approach, to retrieve just one

of the stored biosignals, without being necessary to

parse the entire ﬁle. This is done by directly navi-

gating to the desired hierarchical node, where all the

necessary information is stored. Note that the Data

Model requires the deﬁnition of the metadata ﬁelds

for each stored biosignal. This implies some over-

head, both in computational terms and to the user,

but it is our opinion that the ﬂexibility and more

detailed information this structure provides compen-

sates the additional workload. As stated earlier, the

Data Model encourages the deﬁnition of the metadata

ﬁelds according to each application case, and, accord-

ingly, the biosignal nodes may contain varying sets of

metadata ﬁelds.

3.3 Implementation

Having described the abstract data model, it is now

relevant to discuss how to put it into practice. To that

end, the two technologies presented below, NoSQL

databases and the Hierarchical Data Format, were

considered since they exhibit a set of properties that

meet the previously discussed requirements.

3.3.1 NoSQL Databases

As the name suggests, NoSQL is a family of Database

Management Systems (DBMS) that does not use the

SQL language, and many times does not employ the

relational model at all (Strauch, 2011). These stor-

age systems do not need ﬁxed schemas, and typically

scale horizontally. They are optimized for retrieve and

append operations, and are capable of dealing with

large amounts of data, making them ideal to solve the

problem at hand. Typical NoSQL categories include

key-value stores, Big Table implementations, docu-

ment stores and graph databases (Strauch, 2011). Of

these, the most interesting in the context of biosignals

is the document-based approach.

The basic notion of this type of DBMS is that

a document encapsulates and encodes data in some

standard format (e.g. XML, JSON). Documents are

analogous to rows in the SQL language, but they do

not need to comply all with the same schema, i.e.,

Rec o rd = {

’ He a d e r ’: {

’ Subject ’: < P ati ent ID >

’ Re c ord ID ’: < Re c ord ID >

’ Date ’: < S tart Date > + < S t art

Time >

} ,

’ Bi os i gn a l 1 ’: {

’ Duratio n ’

’ Sa m ple Rate ’

’ L abel ’: < L a b el [1] >

’ Tra n s duce r ’: < T ra n sd uc e r [1] >

’ Ph ysi cal Di m e nsion ’: < P h ys i ca l

Di m en s io n [1] >

’ Ph ysi cal Maximum ’: < Ph ysi cal

Ma x imu m [1] >

’ Ph ysi cal Minimum ’: < Ph ysi cal

Mi n imu m [1] >

’ Di g it a l Max i m u m ’: < D i gi t al

Ma x imu m [1] >

’ Di g it a l Min i m u m ’: < D i gi t al

Mi n imu m [1] >

’ A udit ’: < P re fi l te ri ng [1] >

} ,

...

’ Bi os i gn a l NS ’: {

’ Duratio n ’

’ Sa m ple Rate ’

’ L abel ’: < L a b el [ NS ] >

’ Tra n s duce r ’: < T ra n sd uc e r [ NS ] >

’ Ph ysi cal Di m e nsion ’: < P h ys i ca l

Di m en s io n [ NS ] >

’ Ph ysi cal Maximum ’: < Ph ysi cal

Ma x imu m [ NS ] >

’ Ph ysi cal Minimum ’: < Ph ysi cal

Mi n imu m [ NS ] >

’ Di g it a l Max i m u m ’: < D i gi t al

Ma x imu m [ NS >

’ Di g it a l Min i m u m ’: < D i gi t al

Mi n imu m [ NS ] >

’ A udit ’: < P re fi l te ri ng [ NS ] >

}

Figure 3: Mirroring of the EDF+ ﬁle structure onto the Data

Model, using a JSON-style notation; text between ”<>”

identiﬁes the corresponding ﬁeld in the EDF+ format.

they do not need to possess the same ﬁelds, which

ﬁts nicely the previously mentioned requirements. In

this work, the CouchDB (Anderson et al., 2010) and

MongoDB (Chodorow and Dirolf, 2010) implemen-

tations were analyzed. The reason for choosing these

two document-based NoSQL DBMSs resides in the

fact that both encode the data in JSON (BSON – bi-

nary JSON – for MongoDB), which is becoming in-

creasingly popular as a language-independent, data-

interchange format (Crockford, 2006). Table 1(c) pro-

vides a comparison of MongoDB and CouchDB.

StorageBIT-AMetadata-aware,Extensible,SemanticandHierarchicalDatabaseforBiosignals

Table 2: Implementation alternatives for the desired biosignal storage system; solution ”A4” is the proposed approach.

Type Description Code Technologies

DB Only

All data and metadata stored in the

A1.1 MongoDB

A1.2 CouchDB

File Only

All data and metadata stored in

HDF5 ﬁles

A2 HDF5

File in DB

All data and metadata stored in the

DB, but data ﬁles are attached to

documents on the DB

A3.1

MongoDB,

HDF5 via

mongoﬁles

A3.2

MongoDB,

HDF5 via

rawIO

File with

Metadata stored in DB, data stored

in ﬁles served out of a dedicated

and distributed ﬁle system

MongoDB,

HDF5

3.3.2 Hierarchical Data Format (HDF5)

The Hierarchical Data Format is a self-describing data

format designed to store and organize large amounts

of numerical data (HDF, 2010). It is supported by

many software platforms, including C/C++, Java,

Matlab and Python. The ﬁle structure is based on two

major types of objects: Datasets (multidimensional

arrays of a homogeneous type), and Groups (con-

tainer structures which can hold datasets and other

groups). This model is very similar to the philosophy

behind the design of JSON ﬁles, being completely

ﬂexible, and thus allowing the storage of annotations

and all the desired metadata. In particular, it is possi-

ble to deﬁne attributes for both the Dataset and Group

objects, storing the metadata associated with them.

Datasets can be compressed (GZIP), and can also be

extended and accessed non-sequentially, an advantage

resulting from the use of B-Trees (Bayer, 1971). It is

also possible to link to other HDF5 ﬁles. It is obvious

from this discussion, and from the comparison made

in Table 1(a), that the HDF5 ﬁle format is the one that

best covers the listed requirements, with the exception

of the querying property. Therefore, a simple query

engine was implemented to search the HDF5 ﬁles.

3.4 Implementation Alternatives

Making use of MongoDB, CouchDB and HDF5, we

implemented and evaluated the performance of the

following approaches: 1) DB Only: All data and

metadata is stored in the DB; this was done using

MongoDB (case A1.1) and CouchDB (case A1.2);

2) File Only: All data and metadata is stored in

HDF5 ﬁles (case A2); 3) File in DB: All data and

metadata is stored in the DB, but HDF5 ﬁles with

the signal data are attached to documents on the DB;

using MongoDB, two alternatives for ﬁle attachment

were tested, one using mongoﬁles

(case A3.1), the

other using rawIO

(case A3.2); and 4) File with DB:

Metadata stored in MongoDB, data stored in HDF5

ﬁles served out of a dedicated and distributed ﬁle sys-

tem (case A4). Solution A4 is the proposed approach.

This information is summarized in Table 2.

4 PERFORMANCE RESULTS

The task of evaluating a DB system is not an easy one.

It should be noted that there is no single evaluation

measure that universally classiﬁes a DB system, as

it is highly dependent on the speciﬁc application. In

this sense, the ﬁrst question that should be posed is the

question of applicability, i.e., if the system solves the

problem at hand. Secondly, the system is to be used

by people from different backgrounds and, as such, it

should be easy to use. Finally, performance measures

like speed, persistency and scalability should also be

addressed.

Taking into account the various approaches de-

scribed above, several tests were designed for exper-

imental evaluation, in order to assess their perfor-

mance. In particular, insertion, update and query op-

erations were measured, and in order to have a real-

istic test environment, a simulation setting was cre-

ated. This base setting is composed of 100 experi-

ments, 1000 subjects and 1000 records (1 per subject,

10 per experiment). For each record, a dummy biosig-

nal (all zeros) was added, with 32 channels, lasting for

900 seconds and sampled at 1024 Hz. Each element

of the signal is a 16-bit integer, which accounts for a

A command line tool included in the MongoDB distri-

bution to access GridFS, its embedded ﬁle system.

An alternative to the mongoﬁles tool using the Python

interface to handle data streams.

HEALTHINF2013-InternationalConferenceonHealthInformatics

Figure 4: Results of the Biosignal Storage test.

total size of 56.25 MB per record, or about 55GB for

the entire base set. On top of this base set, the follow-

ing performance tests were applied: a) Insertion of

metadata ﬁelds; b) Biosignal Storage; c) Biosignal

Retrieval; d) Query of the metadata ﬁelds; e) Up-

date in Place of metadata ﬁelds, without changing

the size of the ﬁeld; and f) Add Update of metadata

ﬁelds, adding a new ﬁeld.

The metadata associated with each record is com-

posed by the names of the experiment and the subject,

and a date string (ISO8601 format). The metadata

associated with each dummy biosignal comprises the

sample rate, resolution, duration, a type (i.e. ’zeros’),

and a list of channel labels.

The tests were applied in a cumulative manner,

i.e., by ﬁrst adding a number of new elements to the

base set, and then executing the tests. This was re-

peated for 10, 50, 100, 500 and 1000 new additions,

so that the ﬁnal set had 1760 experiments, 2660 sub-

jects and 2660 records, with a total size of about

170GB, taking into account all signal data and meta-

data. The performance (P) of the Insertion, Query,

Update in Place and Add Update tests was measured

in operations per second (ops), by timing how long

each operation took to run (t

), and then dividing the

number of operations (N) by the total time (Equation

1):

P =

∑

i=1

(1)

The rate (R) of the Biosignal Storage and Retrieval

operations was quantiﬁed in MB/s by measuring the

amount of data that was being added or retrieved in

each operation (d

), timing that operation (t

) and then

computing a mean data rate (Equation 2):

R =

∑

i=1

∑

i=1

(2)

The tests were run on an Ubuntu machine (v10.10,

64 bits, Intel i7 with 12 cores at 3.33GHz, 16 GB

of RAM), with the necessary code implemented in

Python (v2.6.6). Version 1.0.1 of CouchDB, version

2.0.2 of MongoDB and version 1.8.4 of HDF5 were

used. All elements were run locally in order to reduce

the variability introduced by network load.

Biosignal Storage and Retrieval rates for the var-

ious cases are shown in Figures 4 and 5. In the

case of Biosignal Storage (Figure 4), the fastest re-

sults were obtained for approaches A” and A4, with

a data rate consistently above 100MB/s. A similar

result for these two approaches was already expected,

as the underlying storage technology (HDF5) is com-

mon to them. The less performing results correspond

to solution A1.2. Figure 5 shows the results for the

Biosignal Retrieval tests. Again, the approaches A2

and A4 produce the highest reading rates, although

only while the number of signals is not too big. After

1100 records for A4 and after 1500 records for A2, all

the alternatives perform similarly, at about 100MB/s.

This reduction in performance is possibly due to the

saturation of the system buffers, as the ﬁrst operations

are extremely fast.

Results (mean and standard deviation) for the In-

sertion, Update and Query performance tests are sum-

marized in Table 3, where it is possible to observe

that the approaches using MongoDB (including the

proposed approach) outperform the other solutions

in every case, except for the Record Header Inser-

StorageBIT-AMetadata-aware,Extensible,SemanticandHierarchicalDatabaseforBiosignals

Figure 5: Results of the Biosignal Retrieval test.

Table 3: Results (µ ± σ) of the Insertion, Update and Query tests (in operations per second).

Test

Approach

A1.1 A1.2 A2 A3.1 A3.2 A4

Insertion

experiments 962.4 ± 173.7 14.1 ± 2.7 NA 930.7 ± 113.1 1086.9 ± 290.8 1002.5 ± 210.1

subjects 650.7 ± 360.0 15.6 ± 0.3 NA 821.2 ± 287.2 814.8 ± 417.3 812.9 ± 531.5

record header 220.3 ± 81.0 3.7 ± 0.02 930.1 ± 175.6 231.4 ± 103.5 289.3 ± 44.9 175.2 ± 59.6

Query

889.4 ± 50.6 5.3 ± 3.0 0.4 ± 0.2 849.0 ± 25.0 1100.1 ± 171.4 1097.7 ± 403.8

Update

in place 13100.4 ± 4966.7 19.1 ± 0.4 2235.8 ± 1061.7 10930.1 ± 6448.6 14809.4 ± 4383.2 660.3 ± 359.9

add 14346.8 ± 2973.6 19.0 ± 0.6 2797.3 ± 31.5 12907.9 ± 6205.4 17096.7 ± 4238.1 818.7 ± 35.4

tion in case A2. Note that, in this table, there are

no Experiments or Subjects Insertion results for so-

lution A2 given this solution is solely based on HDF5

ﬁles, where the creation of a new Record already in-

cludes all the metadata. Solution A1.2, employing

CouchDB, is the least performing one, in particular

for the Querying test. Indeed, there is not, at the

moment, a straightforward method the make general

queries (e.g. retrieve all documents with a speciﬁc

ﬁeld matching an input value) in CouchDB. Instead,

it uses views (in a MapReduce philosophy (Dean

and Ghemawat, 2004)), which list the documents that

match a certain criterion, but these views must be de-

ﬁned a priori, therefore limiting the scope of what is

queryable to the existing set of views. Additionally,

the interactive creation of views is not feasible, be-

cause this process is expensive to compute, as a new

index must be built, thus impacting the overall per-

formance of the system. On the other hand, Mon-

goDB’s query engine, based on memory-mapped ﬁles

and learning from the stored data, is highly efﬁcient,

and consequently provides superior performance, re-

gardless of the existence of an index. Comparing the

MongoDB solutions (A1.1, A3.1, A3.2 and A4), the

results are similar, taking into consideration the stan-

dard deviations, except for the Record Header Inser-

tion and Update tests, where approach A4 is slower,

as a result of having to insert or update the metadata

both in the MongoDB database and in the HDF5 ﬁle.

Regarding the observed errors for the Update tests,

note that, for example in case A3.2, the mean oper-

ation time is lower than 7 × 10

−5

seconds. This is a

quite small value to measure and, therefore, any small

imprecisions lead to big errors in the resulting met-

ric, which, by Equation 1, depends on the inverse of

the measured time. Overall, we believe that solution

A4 provides the best combination of properties, uni-

fying the efﬁcient storage and retrieval of biosignal

data provided by HDF5 ﬁles, with the fast and ﬂexible

Insert, Query and Update operations of MongoDB.

5 DISCUSSION AND

CONCLUSIONS

In this work, the topic of storing and accessing biosig-

nals was addressed in a metadata-aware, extensible,

semantic and hierarchical way. In particular, a data

HEALTHINF2013-InternationalConferenceonHealthInformatics

model was deﬁned and evaluated, database technolo-

gies and ﬁle formats were presented, and various im-

plementations were evaluated using a series of bench-

marking tests for biosignal storage and retrieval, data

insertion, update and querying. The results show

that, ﬁrst of all, MongoDB is, generally, faster than

CouchDB as a DBMS. This is due to the fact that

MongoDB uses memory-mapped ﬁles, keeping the

more frequently accessed information in Random Ac-

cess Memory (RAM). Regarding the operations of in-

serting and retrieving biosignal data, the most efﬁ-

cient approach is using the HDF5 ﬁle format. Based

on these results, our work proposes a hybrid solu-

tion (solution A4), which uses MongoDB as a front-

end for data access (stores metadata), and HDF5 as

the backbone for data persistency (stores synchronous

time series data and asynchronous event-based data),

as can be seen in Figure 6. In this way, metadata is

put under the spotlight, as it is by querying the meta-

data that a user of the system has access to the desired

signals, stored in an HDF5 ﬁle. Therefore, the pro-

posed system combines fast querying, insertion and

retrieval of biosignal data that is easily accessed from

a variety of systems and platforms. Using HDF5 ﬁles

also has the additional advantage of enabling the shar-

ing of individual records by conventional (ﬁle-based)

methods.

SERVER

CLIENT

MONGO DB

StorageBIT API

MERGER

UNISON

HDF5 STORAGE

.EXP A

.EXP B

.EXP C

HDF5 STORAGE

.EXP A

.EXP C

SYNC OF

HDF5 FILES

QUERY

Figure 6: Schematic structure of the StorageBIT system.

Figure 6 details the overall architecture of the sys-

tem, where a centralized machine hosts the MongoDB

server and the HDF5 storage. On the client side, the

user has access to the information by sending query

requests to the server, and a local storage of the de-

sired HDF5 ﬁles (organized by experiment) is main-

tained, improving speed of access. The HDF5 storage

on the client is kept in sync with the server with Uni-

son (Pierce and Vouillon, 2004), which automatically

propagates changes from one replica to the other. The

StorageBIT module is part of an integrated biosig-

nal acquisition, storage, visualization, annotation and

processing framework, the Biosignal Igniter Toolkit

(BIT), which is currently under development at out

research group.

The proposed system thus extends the state-of-

the-art in the ﬁeld, overcoming many of the common

problems faced by biosignal researchers, such as han-

dling the metadata, no size limits to header and meta-

data information, and seamless distributed access to

the data repositories. Due to the underlying data per-

sistency technology, our system is fully compatible

with existing standards, as the Data Model can be

shaped to mirror any commonly used ﬁeld structure.

Further work includes the speciﬁcation of the for-

mat for the Audit ﬁeld, which is intended to function

as a detailed history of the various processing steps

applied to the stored signals.

ACKNOWLEDGEMENTS

This work was partially funded by Fundac¸

para a Ci

encia e Tecnologia (FCT) under

grants SFRH/BD/65248/2009, SFRH/PRO-

TEC/49512/2009, PTDC/EIA-CCO/103230/2008

and PTDC/DES/103178/2008. The authors would

also like to thank the Institute for Systems and

Technologies of Information, Control and Communi-

cation (INSTICC), the graphic designer Andr

e Lista,

Prof. Pedro Oliveira, and the Instituto Superior de

Educac¸

ao e Ci

encias (ISEC), for their support to this

work.

REFERENCES

Anderson, J. C., Lehnardt, J., and Slater, N. (2010).

CouchDB: The Deﬁnitive Guide Time to Relax.

O’Reilly Media, Inc., 1st edition.

Bayer, R. (1971). Binary b-trees for virtual memory. In Pro-

ceedings of the 1971 ACM SIGFIDET (now SIGMOD)

Workshop on Data Description, Access and Control,

SIGFIDET ’71, pages 219–235, New York, NY, USA.

ACM.

Brooks, D. (2009). Extensible biosignal metadata a model

for physiological time-series data. In Eng. in Medicine

and Biology Society, 2009. EMBC 2009. Annual Inter-

national Conference of the IEEE, pages 3881 –3884.

Brooks, D., Hunter, P., Smaill, B., and Titchener, M. (2011).

BiosignalML - a meta-model for biosignals. In Eng.

StorageBIT-AMetadata-aware,Extensible,SemanticandHierarchicalDatabaseforBiosignals

in Medicine and Biology Society,EMBC, 2011 Annual

International Conference of the IEEE, pages 5670 –

5673.

Chodorow, K. and Dirolf, M. (2010). MongoDB: The

Deﬁnitive Guide. O’Reilly Media.

Crockford, D. (2006). The application/json media type for

JavaScript Object Notation (JSON). Network Work-

ing Group, json.org/.

Dean, J. and Ghemawat, S. (2004). MapReduce: simpliﬁed

data processing on large clusters. In Proceedings of

the 6th conference on Symposium on Opearting Sys-

tems Design & Implementation - Volume 6, OSDI’04,

pages 10–10, Berkeley, CA, USA. USENIX Associa-

tion.

Goldberger, A. L., Amaral, L. A. N., Glass, L., Hausdorff,

J. M., Ivanov, P. C., Mark, R. G., Mietus, J. E., Moody,

G. B., Peng, C.-K., and Stanley, H. E. (2000). Phys-

ioBank, PhysioToolkit, and PhysioNet: Components

of a new research resource for complex physiologic

signals. Circulation, 101(23):e215–e220.

HDF (2010). The HDF group. Hierarchical data format ver-

sion 5, 2000-2010. www.hdfgroup.org/HDF5.

Hellmann, G., Kuhn, M., Prosch, M., and Spreng, M.

(1996). Extensible biosignal (EBS) ﬁle format: simple

method for eeg data exchange. Electroencephalogra-

phy and Clinical Neurophysiology, 99(5):426 – 431.

Jain, A., Duin, R., and Mao, J. (2000). Statistical pattern

recognition: A review. Pattern Analysis and Machine

Intelligence, IEEE Transactions on, 22(1):4–37.

Kemp, B. and Olivan, J. (2003). European data format

‘plus’ (EDF+), an edf alike standard format for the

exchange of physiological data. Clinical Neurophysi-

ology, 114(9):1755 – 1761.

Kokkinaki, A., Chouvarda, I., and Maglaveras, N.

(2008). An ontology-based approach facilitating uni-

ﬁed querying of biosignals and patient records. In

Eng. in Medicine and Biology Society, 2008. EMBS

2008. 30th Annual International Conference of the

IEEE, pages 2861 –2864.

Kvedar, J. (2011). A physician’s perspective on self-

tracking. MIT - Technology Review.

Lovell, N., Magrabi, F., Celler, B., Huynh, K., and Gars-

den, H. (2001). Web-based acquisition, storage, and

retrieval of biomedical signals. Eng. in Medicine and

Biology Magazine, IEEE, 20(3):38 –44.

McGuinness, D. L. (2003). Ontologies come of age. Spin-

ning the Semantic Web: Bringing the World Wide Web

to Its Full Potential.

MFER (2003). Medical waveform format encoding rules.

http://ecg.heart.or.jp/En/Index.htm.

Penzel, T., Kemp, B., Klosch, G., Schlogl, A., Hasan, J.,

Varri, A., and Korhonen, I. (2001). Acquisition of

biomedical signals databases. Eng. in Medicine and

Biology Magazine, IEEE, 20(3):25 –32.

Pierce, B. C. and Vouillon, J. (2004). What’s in Unison? A

formal speciﬁcation and reference implementation of

a ﬁle synchronizer. Technical Report MS-CIS-03-36,

Dept. of Computer and Information Science, Univer-

sity of Pennsylvania.

Strauch, C. (2011). NoSQL databases. Technical report,

Stuttgart Media University.

Varri, A., Kemp, B., Penzel, T., and Schlogl, A. (2001).

Standards for biomedical signal databases. Eng. in

Medicine and Biology Magazine, IEEE, 20(3):33 –37.

Vaseghi, S. V. (2006). Advanced Digital Signal Processing

and Noise Reduction. Wiley, 3rd edition.

Wal, T. V. (2007). Folksonomy coinage and deﬁnition.

http://vanderwal.net/folksonomy.html.

HEALTHINF2013-InternationalConferenceonHealthInformatics