for d = 1, 2 . . . , D. The level to which the data match
is quantified as log-likelihood:
z
lik
i,d
= log L
f
d
(x
i
) = log
n
i
∏
j=1
f
d
(x
i, j
) (4)
This method generalizes the binning approach if we
consider the bins as uniform distributions.
The question is how to define the set of character-
istic distributions f
d
for d = 1, 2 . . . , D. We propose
the following approach that generates a rich space of
them - to consider four normal distributions for each
i = 1, . . . , n:
• N (ˆµ(x
i
),
ˆ
σ(x
i
))
• N
ˆµ(x
i
),
ˆ
σ(x
i
)
2
• N
ˆµ(x
i
) −
ˆ
σ(x
i
)
2
,
ˆ
σ(x
i
)
2
• N
ˆµ(x
i
) +
ˆ
σ(x
i
)
2
,
ˆ
σ(x
i
)
2
Thus, we generate an abundance of D = 4 · m dis-
tributions, which requires a robust regularization ap-
proach.
The motivation for this choice of distributions is
to capture the each-other matches between i and j
records for i, j ∈ {1, . . . , n
i
} and whether record i has
values bellow or above the record j.
3.2 Note on Comparison
When using the introduced vectorizations in machine
learning tasks, we considered two approaches:
• Approach 1: To combine the vectorization with
min-max scaler and a simple model with robust
regularization. For example, the logistic regres-
sion can be applied with cross-validation to select
the right regularization parameter (Golub et al.,
1979). Similarly, we can use Lasso for regres-
sion. The essential advantage of this approach
is the interpretability of coefficients. The robust
regularization makes it applicable to all vectoriza-
tion methods, even if they significantly differ in
the number of features.
• Approach 2: To use an auto ML library that can
handle nonlinearity as well as interaction of fea-
tures. We consider this for the comparison as the
only way due to the different numbers of features.
We adopted TPOT (Le et al., 2020).
To obtain a statistically sound comparison of vari-
ous vectorizations, we adopt CV 5x2 test (Alpaydm,
1999) that is broadly adopted as a tool for comparison
of machine learning in general.
4 CASE STUDY: IMAGE
MATCHING
4.1 Case Study Statement
Our selected classification problem is motivated by a
document-processing pipeline, which requires opera-
tors to check if a pair of scans correspond to the same
underlying physical document. In this document-
processing pipeline
1
, physical documents are scanned
twice:
• once using a mobile phone scanning application
• and a second time on standard office scanners.
We call these mobile scans and standard scans, re-
spectively. Therefore, mobile and standard scans re-
sult in near-duplicate but not pixel-perfect, identical
scans. Minor differences arise due to lighting, an-
gle, cropping, and differing devices. An example of
matching image pairs may be seen in Figure 2a, and
non-matching image pairs may be seen in Figure 2b.
The task is to determine whether a given pair of a mo-
bile scan, and a standard scan are of the same under-
lying physical document, i.e., a binary target y
i
corre-
sponding to a classification task.
More formally, given two images s(d
a
) and s
′
(d
b
),
where s(d
a
) is a mobile scan s of document d
a
, and
s
′
(d
b
) is a standard scan s
′
of document d
b
, determine
if a = b:
y
i
=
(
1, if a = b
0, otherwise
(5)
Features are extracted using the ORB algo-
rithm (Rublee et al., 2011). The ORB algorithm iden-
tifies key points in the image, and each key point has a
corresponding feature vector, also known as a descrip-
tor. Keypoints are then matched by pairing key points
with the lowest calculated distance between their re-
spective descriptors.
Figure 2 displays key points and their correspond-
ing matches for matching and non-matching image
pairs. The top 20 matches are shown. Notice that
in Figure 2a, keypoints are matched well but not per-
fectly, while in Figure 2b understandably, they cannot
be matched well. Tendency, but still not sharp clar-
ity, is also evident from Figure 1 where we compare
two histograms - one for a case where the scans come
from the same document and one where they do not.
Every identified match thus results in a distance
based on the quality of the match. The number of
identified matches n
i
in each image pair may vary, re-
sulting in a set of observed distances x
i
. Therefore,
1
More details and business context is described in
(
ˇ
Capek, 2022).
Probability Distribution as an Input to Machine Learning Tasks
125