scientific datasets (Greiner-Petter et al., 2020). For
math image recognition, there are conversions such
as image to LaTeX (Peng et al., 2021; Wang and
Liu, 2016) and image to markup (Deng et al., 2017).
Other work summarizes content by generating head-
lines with math equations (Yuan et al., 2020). How-
ever, none of these are intended for the web, and
the datasets are also specific. For the web, a for-
mula search available from a browser is “Approach
Zero” (Zhong, 2022), which allows users to search
for formulae in specific databases. For PDFs, there
is research on analyzing PDFs using OCR software
and presenting math expression images in response
to a query (Yamada and Murakami, 2020) and re-
search on detecting math formula regions as bounding
boxes around formulae in PDFs using a CNN (Dey
and Zanibbi, 2021). Our research aims to extract con-
cise math expression images from images in HTML
documents on the Web by binary classification with-
out directly analyzing the contents of the images. To
our knowledge, no similar studies were found.
3 EXPERIMENTS
After setting a dataset, we applied preprocessing in-
cluding elimination of duplicate images, and deter-
mined the concise math expression conditions. We
then checked the correct images of the dataset based
on the conditions and conducted two experiments for
evaluating the performance of the created classifiers.
Experiment 1 used machine learning methods other
than deep learning. Experiment 2 used CNNs. Fi-
nally we compared all of the classifiers and selected
the best one.
3.1 Dataset
Table 1: Dataset. These raw data include duplicate images
and errors in preprocessing. “Other than html” includes
PDFs, slides, Google Books and so on.
Dataset Image Acquired webpage breakdown
Total Correct Other than html Error Html
D
0 trn
19,470 442 1,091 80 1,829
D
0 val
15,427 351 1,269 60 1.671
D
0 tst
23,988 929 1,662 72 2,266
Total 58,885 1,722 4,022 212 5,766
We use the same dataset as studied in a prior work
(Yamada et al., 2018). We randomly selected 100 key-
words from the index of Bishop’s “Pattern Recogni-
tion and Machine Learning” (Bishop, 2006) and per-
formed a web search using these keywords as queries
to obtain the top 100 web pages. We created a dataset
by extracting all the images from those pages. In Ta-
ble 1, D
0 trn
is the keywords from 31 to 60 as the train-
ing dataset, D
0 val
is the dataset from keywords 1 to
30 as validation, and D
0 tst
is the keywords from 61 to
100 as the testing dataset. The first author manually
judged images to determine whether they were related
to the keywords. When unclear cases surfaced, judg-
ments were made in consultation with another person
(the same person throughout all judgments). Keyword
examples are softmax function, SVM, kernel density
estimation method, Heaviside step function, Gaus-
sian kernel, convex function, Probit function, Boltz-
mann distribution, functional derivative, and least-
mean-squares algorithm.
3.2 Preprocessing
Because of the method used to create D
0 trn
and D
0 val
,
they included the same image registered with different
IDs. Therefore, we deleted the ones with overlapping
features. In Experiment 1, the basic features (file size,
width, and height) were used, so images with these
values overlapping were deleted. In Experiment 2,
the images were used directly, so the images with the
same features and the same appearance were deleted.
In addition, unnecessary icons such as buttons and lo-
gos were removed from the dataset for Experiment
1. We extracted the common strings from the image
names of the unnecessary icons in D
0 trn
, and images
with these strings in their image names were deleted
in advance.
3.3 Determining Concise Math
Expression Conditions
After preprocessing, we obtained 314 of the origi-
nal 442 keyword-related correct images in D
0 trn
(Ta-
ble 1) and analyzed them to identify the conditions
of a concise math expression. As a result of a web
search using the above keywords, many of the correct
math expressions have proper names such as Gaus-
sian kernel. Therefore, they are written in an orga-
nized form and are interpretable by the expressions
themselves. That means these images are considered
suitable candidates for concise math expression im-
ages. We examined the following by directly view-
ing the images: “Number of horizontal characters (in-
cluding symbols),” “number of vertical characters (in-
cluding symbols),” “number of lines,” “number of ex-
pressions
1
,” and “number of concatenations” (=, <,
and so on). Because the fonts used in web math ex-
1
The number of expressions in () is 1. Nested expres-
sions in an expression are not counted.
ICAART 2023 - 15th International Conference on Agents and Artificial Intelligence
910