In this paper, we present a novel approach to explor-
ing collections of semi-structured data by extending
the cell-centric indexing approach. We emphasize the
importance of a user-friendly interface in achieving
the overarching objective of data exploration and re-
trieval tasks. Our solution takes about just over an
hour to index 40 GB of semi-structured data. Over
99% of the queries are executed in under one second.
Our work not only empowers researchers to effort-
lessly access and retrieve data without prior knowl-
edge of its organization but also highlights its appli-
cability in material science.
Our short-term plans include incorporating our
system into the experimental pipeline of a small num-
ber of materials scientists. We will collect feedback
via surveys and think-aloud studies, adjust the sys-
tem as necessary, and expand the user group. A
key step is explicitly incorporating images, espe-
cially microscopy. Recently, Nguyen et al. (2021)
demonstrated the capabilities of a symmetry-aware
neural network featurization in exploring large un-
structured databases of microscopy images. We in-
tend to explore ways to use image embeddings as
additional criteria for refining searches, while main-
taining the performance we achieved by using Elas-
ticsearch as our backend indexing system. By in-
tegrating database management, ontology develop-
ment, and machine learning, we aim to enable effi-
cient metadata searches and facilitate the comparison
of physics-aware features within microscopy images.
This initiative holds the promise of accelerating the
exploration of synthesis-structure-property relation-
ships to advance materials design. Although our em-
phasis has been on scientific data management, these
approaches are general enough to be applied to any
enterprise that has a data lake of semi-structured data.
This material is based upon work supported
by the National Science Foundation under Grant
No. 2246463
Data Discovery and Indexing for Semi-Structured Scientific Data