For data that does not match any standard distribution,
the DataFITR tool can generate arbitrary uni-variate
distributions to match the histogram using the Kernel
Density Estimation (KDE) approach (Parzen, 1962).
Most importantly, once a desired distribution is found,
the tool automatically produces Python code for ran-
dom variate generation corresponding to the selected
distribution. This code can be directly copied into a
stochastic simulation model to generate the random
variates. DataFITR has been written in Python, and
a GUI front-end is created using the Streamlit library
(Streamlit, 2022). The tool is currently cloud-hosted
on the Streamlit community cloud. DataFITR cur-
rently supports time-independent models (with a large
set of standard distributions or arbitrary distribution),
and for correlated data multi-variate Gaussian distri-
butions are currently supported. We plan to add sup-
port for time-dependent models and arbitrary multi-
variate distributions in future versions.
The rest of this paper is organised as follows: In
Section 2 we provide a brief overview of open li-
braries and existing tools available for building digital
twins with a focus on IM. In Section 3, we describe
the features, usage flow and details of the DataFITR
tool. In Section 4 we present a simulation case study
of a bottling plant which serves to highlight the util-
ity of the tool. In this case study, we generate data
using a known reference model (a discrete-event sim-
ulation model) of a bottling plant, and use this data
and some knowledge about the real system to create
a matched model automatically using the DataFITR
tool. We present results showing the extent of match
between the original reference model and the matched
model in terms of the system parameters and out-
put/performance measures. Finally, we present con-
clusions and future plans in the last section.
2 RELATED WORK
A broad survey of tools and processes in creating dig-
ital twins is presented in (Fuller et al., 2020). In-
put Modeling (IM) is a critical step and historically
IM techniques have focused on offline system mod-
els (Cheng, 2017). An overview of IM techniques
for various problem domains is described in (Nelson
and Yamnitsky, 1998). Commercial software tools
such as ExpertFit (Law, 2020) and Stat::Fit (Soft-
ware, 2022) support in IM by identifying probabil-
ity distributions to fit observed data. These tools also
assist the user in selecting a distribution when data
is unavailable based on system knowledge, for as-
pects such as task times and equipment failures. XL-
STAT (Lumivero, 2022) is a commercial Excel-based
tool that can be used for IM. Aside from commer-
cial tools, a few libraries in popular programming lan-
guages such as Python and R exist for fitting proba-
bility distributions to data. Distfit (Taskesen, 2020)
and fitter (Cokelaer, 2020) are two examples of open
Python-based libraries. Both can be used to fit stan-
dard uni-variate distributions. fitteR (Boenn, 2022) is
an R-based version for fitting distributions to empir-
ical data. Distribution fitter (Distributionfitter, 2022)
is a python based GUI application which is built using
the fitter package. It can be used to fit univariate dis-
tributions. Distribution Analyser (DistributionAnal-
yser, 2022) is another Python-based application that
helps users analyze univariate distributions. It also
allows the users to fit the data into univariate distri-
butions. While these are libraries that can be used via
interface routines, the DataFITR tool described in this
paper is a GUI-based tool that does not require any
programming for its use. Table 1 summarizes the dif-
ferent features and scope of these libraries along with
the DataFITR tool proposed in this paper.
3 DataFITR: FEATURES AND
USAGE
DataFITR is open-source (released as a public repos-
itory on GitHub at (Lekshmi P, 2023)). It is currently
hosted on Streamlit public cloud and freely accessi-
ble via a browser at https://datafitr.streamlit.app. The
user can upload the data in a csv (comma separated
value) format where each column corresponds to a
single measured quantity and the first row is assumed
to contain the names of each quantity. For categori-
cal type of data, it is assumed that the data is integer-
valued.
DataFITR currently supports modeling time-
independent, Independent and Identically Distributed
(IID) data where each measured quantity (column in
the data file) is independent. It also supports mod-
eling the case where some of the columns are corre-
lated. For the multivariate case the tool currently sup-
ports only Gaussian distributions, and the ability to
fit arbitrary multivariate distributions is planned to be
implemented in future. The side panel in the tool can
be used for selecting one of these cases for the IM
flow. To fit time-independent IID data, user can up-
load the file on the on the page corresponding to the
time-independent data. Once the user uploads a file,
the tool automatically identifies the column headings
and whether each column corresponds to real-valued
(continuous) or integer-valued (categorical) data, cre-
ates histograms showing the marginal distributions of
each columns. It also generates a statistical summary
DataFITR: An Open, Guided Input Modeling Tool for Creating Simulation-Based Digital Twins
281