model retraining without specific data points, several
ideas have emerged on how to legally demonstrate
that the information has been removed from the DNN.
The strongest guarantees come from the
mathematical field of differential privacy (DP). These
techniques apply and track noise during the training
process. This noise both restricts the amount of
information the model learns from any single point
while also acting as a regularization optimization
term, allowing it to generalize better to new data. This
DP process is applied to every training point and the
model can often suffer significant loss in
performance, making it no longer useful.
Lui (Lui and Tsaftaris, 2020) introduces the
concept of applying statistical distributional tests
after model training to determine if a model has
forgotten information related to a set of points. It
hinges on having enough new data to train another
model to a similar task accuracy, from which
similarity measures between output distributions can
be utilized. Such a test would be used by an
independent auditor to assess compliance. While
effective, it more directly assesses whether data has
not been used in model training.
Chen (Chen et al., 2020) introduces explicitly
leveraging the MI attack to directly measure how
much privacy information has been degraded. Chen
also introduces two privacy metrics that measure the
difference of the membership inference confidence
levels of a target point between two models.
We agree with this approach; however, they again
use model retaining and shadow models to compute
this statistic. In our work, we advance their approach
in a key way that will support operational
deployments of large, distributed DNNs. Our
approach leverages incremental retraining of a target
model. It does not rely on full retraining of either the
deployed model or a new model for statistical
comparisons. With this redaction technique, data
owners can evolve a model and alter a point’s attack
confidence to a desired level within a ranked listed of
possible training points. It is also possible to make it
appear with high confidence that the point was not
used to train the deployed model, when evaluated
against many other membership inference attack
models.
Note that we don’t use the MI attack models other
than as a compliance mechanism. That is, we don’t
use loss or other information of the attack models
during our re-training optimization. The advantage of
this is that it makes the redactions less dependent
upon the specific attack model and resilient to other
types of attacks.
Also, we only train evaluation attack models to
determine the effectiveness of the Class Clown
technique. Our results show that reducing attack
confidence in one attack model reduces confidence in
all attack models. However, such a step is not
necessary within operational spaces.
2 CLASS CLOWN: SURGICAL
DATA EXCISION THROUGH
LABEL POISONING DURING
INCREMENTAL RETRAINING
It is an open question as to how exactly deep neural
networks are storing and leaking privacy information
on specific data points. However, all of the attacks
rely upon observing shifts in the output based upon
known the shifts in the input. For the vast majority of
attacks, this means exploiting shifts in the output
confidence vectors. The easiest attack is the case
where there is no overlap between training data
output and new data output, for instance, a highly
overfit model, as these can be readily differentiated.
Even Shokri’s original paper indicated that
restricting the model output to the label is not enough
to prevent this attack. Mislabelled predictions and the
differences of these misclassifications can be
exploited as well. This is highlighted in a recent label-
only attack (Choquette Choo et al., 2020).
These shifts in output are the result of many
aggregated computations across the network’s layers
that ultimately define the class decision boundary in
the embedded loss space. However, in the vicinity of
a point, there is a relationship between model
confidence and the distance to its decision boundary.
We leverage this and seek to alter the embedded
loss space of the target model only in the vicinity of
the points that we need to redact. By altering the
point’s local decision boundary, we can shift the
target model confidence outputs, thereby tricking any
membership inference attack model into believing
that the point was not used in training. We use a
mechanism that does so gently without largely
affecting the accuracy or network weights.
We achieve this in an incremental manner starting
from the existing deployed (target) model. For
simplicity, we hone the technique in the context of a
single point, and then extend to multiple redaction
points via an arrival queue representing irregular data
redaction requests.
2.1 Class Label Poisoning
In our approach, we intentionally poison the label of
the point to be redacted in ensuing retraining epochs.
In our experiments, we randomly chose the label to
poison with once, and then use that in every epoch.
Intuitively, this mislabelling decreases the