a. Histogram b. Scatter plot
Figure 1: Flow cytometry data examples. (a) a histogram of
the number of cells measured at different fluorescent
intensity values for the CD154 marker. (b) a scatter plot of
the fluorescent intensity values for the CD3 versus the
CD154 marker, used for identifying smaller populations,
with the quadrant markers demonstrating that most cells
have a low response to both CD154 and CD3.
to measure granularity and membrane roughness
(World Health Organization, 2009). Different
fluorescence molecules or “markers” are used to label
particular types of cells to improve the identification
and quantification of different populations and sub-
populations. The standard identification system for
markers is referred to as Cluster of Differentiation
(CD).
Flow cytometry data is a multidimensional dataset
and the data is generally displayed in one or two
parameters (Moloney and Shreffler, 2008). For one
parameter, it can be displayed as a histogram with the
parameter value on the x-axis and the frequency
(number) of cells on the y-axis (Figure 1a). For two
parameters, the data is displayed as a scatter plot, with
points representing the cell as an (x,y) pair of the
values of the two parameters (Figure 1b). Up to 50
cell parameters can be determined (Lee et al., 2017),
with the number of features dependent on the flow
cytometer and experimental design. Viewing the
entire dataset is involved and complex.
a. 1D histogram b. 2D scatter plot
Figure 2: Manual gating examples using either drawn lines
in 1 dimension (histogram) or polylines in 2 dimensions
(scatter plot), to visually identify populations.
After obtaining the data, an expert operator
identifies the populations - known Gating. Gating is
the process of identifying cells by drawing shapes
around populations (Bashashati and Brinkman,
2009), as shown in Figure 2. The expert needs to
know about the characteristics of the cells of interest,
and the populations and sub-populations of cells
before starting the analysis.
Manual Gating is, therefore, highly subjective and
time consuming - with machine learning being
proposed to support this process (Lo et al., 2008).
FCM data is so large and complex that it is
difficult to analyse without computational tools.
There are three main problems in FCM analysis;
firstly, manual gating (identify cells of interest) is
highly subjective (Lo et al., 2008); secondly,
sometimes the number of key events is very low
(Groeneveld-Krentz et al., 2016), which makes them
harder to detect and may result in false positives;
thirdly, manual gating is a time consuming process
(Rahim et al., 2018), especially when the number of
parameters and cells are large. Although some
applications have been developed to help clinical
experts, flow cytometry data analysis application still
have limitations, as mentioned before. The paper
presents the application of machine learning
techniques to implement a novel automated gating
method which can provide appropriate clustering of
cells in blood samples.
2 METHOD
Ye and Ho, 2018 proposed a state-of-the-art
automated gating technique, FlowGrid, and claimed
higher accuracy and better time efficiency compared
with flowPeaks (Ge and Sealfon, 2012), FlowSOM
(Van Gassen et al., 2015), and FLOCK (Qian et al.,
2010). However, FlowGrid still has the problem with
requirement of too many user-defined parameters.
The method proposed here has the aim of
improving the performance of FlowGrid, by reducing
both process time and user-defined parameters. This
improved method, the FLOPTICS algorithm, begins
by partitioning data into equal-sized grids for each
dimension (‘bins’) – with then only non-empty bins
being processed as data points. An example of
partitioning 2-dimensional data in this way is shown
in Figure 3. Partitioning data is not appropriate for
low density datasets, but FCM data is always high
density (as can be seen in Figure 1 and 2), so the
accuracy results of gating are acceptable and the run
time is faster than many state-of-the-art techniques.
2.1 DBSCAN
Density-Based Spatial Clustering and Application
with Noise (DBSCAN) was proposed by Ester et al.
(1996).
DBSCAN is a density-based algorithm for