output and classify the type of anomaly based on that
shape.
The automated method to flag the anomalous
output is also based on several channels rather than
manually aligning and determining a PVA presence
based on the disjoint evaluation of each, which is
necessary when conducting the inspection manually.
The four primary channels used in the work are the
pressure mask (pmask), abdominal (abdo) and
thoracic (thor) respiratory inductance bands, and flow
as output form the NIV device (flow-tx). This multiple
channel feature is particularly useful in cases such as
ineffective efforts, which would not necessarily show
up on a single channel but could still be present.
3.2 Software Implementation
To integrate the output for the machine-learning
algorithm, an open-source format – the European
Data Format (EDF) specification (www.edfplus.info)
– has been chosen for data representation and
manipulation, not only because of the standardised
structure and well-supported open-source
community, but also to allow the portability of the
output across different platforms. This open format
supports the extraction in a standardised structure –
of both scored labels and free-text comments, which
have been added manually into the integrated bedside
system. It enforces a degree of structure on data
offering critical contextual data during a monitored
patient sleep. Such data is usually openly structured
and difficult to report in a standard way. Both scored
labels and free text annotations also present a
heterogeneity challenge in that there is no
standardized input approach. This effect is amplified
when there is more than one clinician involved in data
entry.
Therefore, the first step in producing the EDF file
with additional ML-output channel, is to extract the
original EDF file from the integrated bedside clinical
monitoring system, in this case CompuMedics, with
one EDF file per patient per stay. The meta-data
outlining the annotations accompanying that patient
stay are captured in the associated XML descriptor
file.
Once the original EDF has been extracted, the full
data integrity of that file is checked and written to
another newly created EDF file. This new file is
composed of the original physiological channels
chosen by the user with the addition of new channels
containing the ML-generated output.
This is achieved using the python library pyedflib,
which is a fork from the library EDFlib
(www.teuniz.net). These libraries are used to read the
EDF file’s properties and values including: number of
signals, channel indexes, sample frequency, and
number of data records. The physiological channels
are then read into this library for pre-processing, in
anticipation of processing by the ML algorithm. For
ease of persistent data storage, and to ease the burden
of volatile memory requirements, intermediate files
are written as part of these pre-processing steps.
These are stored in the Apache Parquet file format
(parquet.apache.org) - an open source, column-
oriented data file format, which uses in-built
compression for efficient data storage and retrieval.
The output of the parquet files is then fed into the
ML algorithm. The operation of the algorithm
involves many dependencies in both software and
hardware including:
• A memory-efficient version of the Anaconda
Python environment, known as MambaForge
(mamba.readthedocs.io), which provides a
setting that maximises the available underlying
memory to run the memory-intensive ML
algorithm.
• A library called Signatory that allows the
calculation of a “signature transform”, an
operation roughly analogous to a Fourier
transform that extracts information on the order
and area of a given data stream.
• Underlying GPU acceleration at a hardware
level, requiring the activation of an NVIDIA
processor (if available). In this project the Azure
cloud resource provides a VM within the “NC
series” that provides an NVIDIA GeForce RTX
3090 chip with 24 GiB
Using this software stack, once the algorithm is
fully computed, it is stored in a multi-dimensional
numpy array type, which the EDF file format and
libraries use heavily for functional operation. An
anomaly list is successively generated from the
numpy array, and this list is iterated over to produce a
sub-set of values where the ML metric has gone over
a user-supplied threshold number. This new subset
array is written to the same output EDF file but in the
form of point annotations.
In terms of timing, for each down-sampled
window, a corresponding value is produced
(measured in the arbitrary units of “ML-P”), along
with a corresponding timing value. This timing value
can be configured to be situated in the window at the
beginning (value 0), the end (value 1), the middle
(value 0.5), or any point in between. Due to the down-
sampling of the output by a factor of 16, the ML
output is rendered at a sample frequency of 2 Hz,
when combined with the pmask output in the final