TRUE and since this is a conjunction, φ
ds
∧ π
#
evalu-
ates to FALSE. Similarly, if m is 10, then the π
#
makes
the conjunction evaluate to FALSE. Thus, we say that
the data set does not hold the property π
#
.
Let us start examining more complex properties
that can be formally verified over the data set. A
slightly more complex property to verify is: “the data
set must be min-max normalized,” which can be ex-
pressed in MSFOL as π
±
↔ @(i, j ∈ Z)((i ≥ 0) ∧ (i <
n) ∧ ( j ≥ 0) ∧ ( j < m) ∧ ((D[i][ j] < min) ∨ (D [i][ j] >
max))). Certainly min and max are defined constants
(e.g., -1 and 1) an either these variables must be de-
fined or the value must be replaced; for min = −1 and
max = 1, φ
ds
holds the property π
±
(as φ
ds
∧ π
±
is
satisfiable).
The previous properties are useful to showcase
how easy is to translate desired properties into the
formalism. However, verifying such properties is
quite simple, and furthermore can be uninteresting
as the data set can be normalized later on, for ex-
ample. As previously stated, our motivation comes
from the proper extraction and collection of the data
set. We have discussed the case where training ex-
amples are provided for some regions of the input
space and some other regions are overlooked. To
verify that “the data set is sampled across the whole
input space,” the following property can be verified
π
∗
↔ @(p ∈ A
Z,R
)∀(i ∈ Z)((i ≥ 0) ∧ (i < m)) =⇒
(
q
∑
m−1
j=0
(p[ j] − D[i][ j])
2
> δ); the property basically
states that there does not exist a point such that it has
a greater Eucledian distance that a chosen constant
δ. As an example, for δ = 1, our example data set
does not hold the previous property π
∗
as there exists
a point in the input space that has greater Eucliden
distance, for example if p[0] = 2 and p[1] = 2. Note
that the property never specifies the minimum or max-
imum values of the input space and thus, it is likely
that no data set is sampled over an infinite domain.
An easy solution is to add such constraints to π
∗
, i.e.,
@(l ∈ Z ∧ (l ≥ 0) ∧ (l < n)(∧(p[l] > max) ∨ (p[l] <
min))), for given max and min constants. We draw
the reader’s attention to the fact that a formal specifi-
cation must be well-stated and this is an assumption
of our work and generally in any formal verification
strategy.
Finally, note that sometimes it is more convenient
to state negated properties. For example, to verify that
the data set is balanced, we can verify the follow-
ing property: “there is no class which has less than
m
(β∗l)
samples,” where l is the number of different out-
puts (labels) and β is a chosen constant. This prop-
erty states that the data set must have equal amount
of samples, up to a given constant. For example, if
β = 1 the data set must be perfectly balanced, while
if β = 2 only half of the samples (of a perfectly bal-
anced data set) are required per class. It is impor-
tant to state that unbalanced data sets represent a real
problem for current machine learning algorithms, and
moreover, it is often encountered in the domain. Ac-
cordingly, researchers actively try to tackle the prob-
lem (see for example (Lema
ˆ
ıtre et al., 2017)). Indeed,
it can be not that intuitive how to state this property
in first order logic. There are many particularities that
must be considered; for example, the fact that there is
no notion of loops in first order logic and we require
to define a function to count the number of instances
where a given label appears. To overcome this partic-
ular problem a recursive function can be defined. In
order to keep the paper readable, we avoid this defi-
nitions and simply denote defined functions in math-
ematical bold-font. The interested reader can refer
to the prototype implementation section (Section 4)
and correspondingly to the tool’s repository to check
the full property implementation. We state the afore-
mentioned property as: π
≡
↔ @i ∈ Z((i ≥ 0) ∧ (i <
l) ∧ (S(O, L [i], m) <
m
β∗l
)), where S(A, v, s) is a func-
tion that returns the number of times the value v is
found in an array A up to index s; that is, that is how
many times the label is found in the label array.
We have exemplified different properties that can
be formally verified in data sets. We do not focus
on an extensive list of properties but, rather on pro-
viding means for formally verifying any property in
a given data set. We could state much more proper-
ties, for example, there are no contradicting training
examples in the data set, i.e., there does not exist two
equal elements in D with different indices for which
the corresponding elements in O differ. We limit this
section with these examples. However, we note that
as shown in the previous examples, the formalism is
quite flexible for expressing real properties of interest.
Discussion – On Standard and Domain-specific
Properties. We have showcased the flexibility of
the proposed approach with somewhat standard prop-
erties to check. One can imagine more of these prop-
erties, for example, guaranteeing that there are no out-
lier training examples
3
in the data set can be logi-
cally expressed as finding points in the space with
high variance. Nonetheless, it is interesting to point
out that the approach is generic and domain-specific
properties coming from expert knowledge can be also
used to formulate properties. For example, consider
a real state data set, where two categorical features,
isHouse and isApt cannot be both TRUE at the same
3
Training examples which have extreme values, far from
the rest of data points.
Toward Formal Data Set Verification for Building Effective Machine Learning Models
253