7 CONCLUSION
In this paper, we present a novel approach to explor-
ing collections of semi-structured data by extending
the cell-centric indexing approach. We emphasize the
importance of a user-friendly interface in achieving
the overarching objective of data exploration and re-
trieval tasks. Our solution takes about just over an
hour to index 40 GB of semi-structured data. Over
99% of the queries are executed in under one second.
Our work not only empowers researchers to effort-
lessly access and retrieve data without prior knowl-
edge of its organization but also highlights its appli-
cability in material science.
Our short-term plans include incorporating our
system into the experimental pipeline of a small num-
ber of materials scientists. We will collect feedback
via surveys and think-aloud studies, adjust the sys-
tem as necessary, and expand the user group. A
key step is explicitly incorporating images, espe-
cially microscopy. Recently, Nguyen et al. (2021)
demonstrated the capabilities of a symmetry-aware
neural network featurization in exploring large un-
structured databases of microscopy images. We in-
tend to explore ways to use image embeddings as
additional criteria for refining searches, while main-
taining the performance we achieved by using Elas-
ticsearch as our backend indexing system. By in-
tegrating database management, ontology develop-
ment, and machine learning, we aim to enable effi-
cient metadata searches and facilitate the comparison
of physics-aware features within microscopy images.
This initiative holds the promise of accelerating the
exploration of synthesis-structure-property relation-
ships to advance materials design. Although our em-
phasis has been on scientific data management, these
approaches are general enough to be applied to any
enterprise that has a data lake of semi-structured data.
ACKNOWLEDGEMENTS
This material is based upon work supported
by the National Science Foundation under Grant
No. 2246463
REFERENCES
Chapman, A., Simperl, E., Koesten, L., Konstantinidis,
G., Ib
´
a
˜
nez, L.-D., Kacprzak, E., and Groth, P. (2020).
Dataset search: a survey. The VLDB Journal, 29(1):251–
272.
Dunaiski, M., Greene, G. J., and Fischer, B. (2017). Ex-
ploratory search of academic publication and citation
data using interactive tag cloud visualizations. Sciento-
metrics, 110(3):1539–1571.
Fan, J., Keim, D., Gao, Y., Luo, H., and Li, Z. (2009).
Justclick: Personalized image recommendation via ex-
ploratory search from large-scale flickr images. Circuits
and Systems for Video Tech., IEEE Trans., 19:273 – 288.
He, D., Brusilovsky, P., Ahn, J., Grady, J., Farzan, R.,
Peng, Y., Yang, Y., and Rogati, M. (2008). An evalua-
tion of adaptive filtering in the context of realistic task-
based information exploration. Inf. Process. Manage.,
44(2):511–533.
Heflin, J., Davison, B. D., and Jia, H. (2021). Exploring
datasets via cell-centric indexing. In DESIRES 2021,
CEUR Workshop Proceedings, volume 2950.
Maier, D., Megler, V. M., and Tufte, K. (2014). Challenges
for dataset search. In Int’l. Conf. on Database Systems
for Advanced Applications, pages 1–15. Springer.
McCusker, J. P., Keshan, N., Rashid, S. M., Deagen,
M., Brinson, L. C., and McGuinness, D. L. (2020).
Nanomine: A knowledge graph for nanocomposite ma-
terials science. In 19th Int’l Semantic Web Conference,
volume 12507 of LNCS, pages 144–159. Springer.
Nguyen, T. N. M., Guo, Y., Qin, S., Frew, K. S., Xu, R.,
and Agar, J. C. (2021). Symmetry-aware recursive im-
age similarity exploration for materials microscopy. npj
computational materials, 7(1):1–14.
Soto, A. J., Kiros, R., Ke
ˇ
selj, V., and Milios, E. (2015). Ex-
ploratory visual analysis and interactive pattern extrac-
tion from semi-structured data. ACM Trans. Interact. In-
tell. Syst., 5(3).
Stansberry, D., Somnath, S., Breet, J., Shutt, G., and
Shankar, M. (2019). DataFed: Towards reproducible
research via federated data management. In 2019 Int’l
Conf. on Comp. Science and Comp. Intelligence (CSCI),
pages 1312–1317.
Wang, M., Ma, H., Daundkar, A., Guan, S., Bian, Y., Se-
hirlioglu, A., and Wu, Y. (2022). CRUX: crowdsourced
materials science resource and workflow exploration. In
Proc. of the 31st ACM Int’l Conf. on Info. & Knowledge
Mgmt., pages 5014–5018. ACM.
White, R. W. and Roth, R. A. (2009). Exploratory Search:
Beyond the Query-Response Paradigm. Synthesis Lec-
tures on Information Concepts, Retrieval, and Services.
Morgan & Claypool Publishers.
Wongsuphasawat, K., Moritz, D., Anand, A., Mackinlay, J.,
Howe, B., and Heer, J. (2016). Voyager: Exploratory
analysis via faceted browsing of visualization recom-
mendations. IEEE Transactions on Visualization and
Computer Graphics, 22(1):649–658.
Zhang, X., Song, D., Priya, S., and Heflin, J. (2013). In-
frastructure for efficient exploration of large scale linked
data via contextual tag clouds. In International Semantic
Web Conference, pages 687–702. Springer.
Data Discovery and Indexing for Semi-Structured Scientific Data
271