validation. We train 5 different Cellpose models for
the 5 splits. Each Cellpose model is trained start-
ing from the pre-trained model ’cyto’ provided by the
Cellpose package. Then we use three different infer-
ence diameter values (sampled from the training set
to represent small, medium and large sized LNPs) to
record the performance of the transfer learned model
on our test dataset before using the pipeline.
Next we perform the steps in our pipeline, and
merge until different values of threshold ranging from
a high of 20 to a low of 0. We record how the per-
formance varies and compare to the values obtained
before post-processing. The results can be found in
Fig. 8, where we record the average precision and
recall across 5-folds when using Cellpose before and
after using the pipeline.
5 DISCUSSION
We evaluate standard benchmarks in cryoEM image
analysis and closely related areas in the context of
LNP localization, to test how well they handle hetero-
geneity in particle size distribution. To do so, we had
to compare different annotation and prediction styles.
We have a small dataset of 38 images of which
approximately 80 % were used for training and 20 %
for testing. Although the number of images is rela-
tively low, the number of particles used for training is
at par with other benchmarks used in cryoEM analy-
sis. We use an average of 1641 particles to train our
model. The authors of crYOLO state that 200–2500
particles are sufficient to train their model (Wagner
et al., 2019). Similarly, Topaz was evaluated on parti-
cle picking of the Toll Receptor protein using a model
that was trained on 686 labeled particles (Bepler et al.,
2019). Of the benchmarks tested, Cellpose outper-
formed both Topaz and crYOLO. However, applying
Cellpose to our data is not unambiguous because it re-
quires the median diameter at inference - a datapoint
we do not know a priori. Moreover, the median diam-
eter might not be the right choice if the particle size
distribution is skewed towards smaller or larger parti-
cles.
To standardize inference while using Cellpose on
segmentation of non-uniform particles, we introduce
an optimization pipeline which allows the user to op-
timize for precision or recall specifically. The goal
of the optimization pipeline to remove the guesswork
from selecting the correct inference diameter when
there is heterogeneous size distribution of particles.
In Fig. 8, our cross-validation results show that there
is no predictability in using just Cellpose without our
pipeline. The horizontal lines represent precision and
recall values when using Cellpose with different infer-
ence diameters. Across different folds, no consistent
relationship can be observed between inference diam-
eter used and corresponding precision and recall.
In contrast, if our optimization pipeline is used,
the precision value trends downwards from a high to
a low threshold while the recall value trends upwards.
The initial threshold, from which the merging process
begins, can be set by the user. It depends on the fre-
quency at which multi-inference was done. We sam-
pled diameters at every ∼ 5 pixels. We found that
a threshold of above 20 returns a blank mask or very
few particles; therefore, we used 20 as a starting point.
If denser sampling is performed, such as every 2 pix-
els, then the user should consider setting the initial
threshold to a value higher than 20. For more sparse
sampling, the initial threshold can be set to be lower
than 20. A high initial threshold will simply result in
a few masks that are empty but should not affect the
merging results. The final threshold value chosen for
merging will affect the results. By choosing a very
low threshold such as 0 or 1, we can optimize for re-
call while choosing a higher threshold such as 15 or
20 allows us to optimize for precision. If we want
to strike a balance between the two, we can choose
a value in the middle such as 5 or 8. The precision
values for higher thresholds are sometimes low be-
cause the starting threshold might have been too high
and not enough particles were captured or more false
positives than true positives were picked up at that
stage. Since the total number of predictions are lower
at higher confidence thresholds, we opted for sparsely
sampled merging thresholds at the start,increasing the
density as we moved lower. The intermediate thresh-
old values to be used can be set by the user.
The correlation between threshold and precision
and recall provides clear intuition to the user as to
which value to use based on their use case. In Fig. 8
it can be observed that on average, merging until the
lowest threshold improves recall by 6.8 % from 0.59
(when using just Cellpose with median diameter for
inference) to 0.63. Similarly, the precision improves
by 4.1 % from 0.73 (when using just Cellpose with
median diameter for inference) to 0.76 (when using
the pipeline).
Our workflow was used by a cryoEM expert to
semi-automate boundary annotation of LNPs in cry-
oEM images. He was able to generate annotations for
5X more particles per hour when using the pipeline.
In future we hope to leverage our large repository
of unlabeled data through semi-supervised learning.
We also hope to extend our analysis to localize the
mRNA which resides within LNPs.
LipoPose: Adapting Cellpose to Lipid Nanoparticle Segmentation
121