the amount of cryptocurrency traded, and VWAP is
volume-weighted average price. The target is the re-
turn on investment in a time frame of 15 minutes.
3.1 Data Preliminary Analysis
After the inheritance of data, to make the dataset fit
the algorithm, it is processed beforehand. This section
investigates the preprocessing of data for the topic
sentence using two sequential methods.
3.1.1 Dataset Framing
In our case, the cryptographic data can amount to up
to 24 million records. This type of data would re-
quire extensive processing time and CPU power. To
reduce the processing time, the record of information
is reduced. It should be noted that the dataset cannot
be reduced without sacrificing important trends. This
means that dataset framing can only be done when
the dataset is known and exploration is done correctly.
As discussed in Section 4, the crypto assets are non-
stationary and used for forecasting their current value.
This gives us the liberty and flexibility to reduce the
data, as it is a non-stationary randomized time series.
Which means if i denotes the initial instant value of a
particular asset, the value at instant i-1, which is the
future next instant, would be a disjointed event. But
it is to be noted that this disjoint parametric does not
influence the overall forecasting, which is done on a
relatively large and sufficient dataset.
For that matter, the following 24 million records
are reduced to 598k records by reducing the dataset
until November 2020. Fig. 2 is the data frame of the
reduced training dataset with the crypto assets fea-
tures for modeling.
3.1.2 Enriching Dataset
Data preliminary analysis is the pre-processing of
data that ensures the data is clean and will not stump
the model to be made. One of the most important
steps is to ensure that the data is completely popu-
lated, i.e., it does not have NAN, NA, or null values.
These values make a sparse vector and run the dan-
ger of inadequate training and a biased model. It can
also lead to less precise or inaccurate modeling. Also,
many ML algorithms do not support missing values.
The best way to avoid this is by filling in or imput-
ing these missing values or by deleting the specific
rows with missing values if the records are sufficient
(Tamboli, 2021). Deletion can be done per row or per
column of the dataset. Whereas, filling can be done
in many aspects, for example, by filling the missing
locations with an arbitrary value, the mean, mode, or
median, or by doing a forward and backward fill. For
categorical values, filling can be done with the most
frequent variable, or the missing values can have their
own category (Tamboli, 2021). To check for missing
values, any values that are not in a finite category are
checked.6 We found that around 1.47%, or 8814 of
the target values, are missing. To tackle this, these
rows are dropped, as 598769 records are more than
enough for training our algorithms. Fig. 3 is the resul-
tant dataset, with now 589955 records, and a general
statistical description of the feature content for each
crypto asset.
3.2 Feature Engineering
In this step of the implementation of the forecast, fea-
tures that influence the forecast are dealt with. In fea-
ture engineering, normally, features are extracted and
a transformed data frame is made from the features.
In the scope of this project, feature extraction is omit-
ted, and instead features are transformed to increase
the accuracy of the algorithm. This is important as
raw features could lead to problems with the algo-
rithms learning from themselves in the model. This
step helps the model refine itself to improve perfor-
mance. These features become the important factors
that affect the business problem.
3.3 One Hot Encoding
In the dataset, there are two categorical variables that
need to be aligned with the numeric trade features.
In Fig. 3, the feature names are Asset ID, which is
unique for all fourteen crypto assets, and timestamp,
which is the minute at which the trade started. The
categorical variables are turned to numeric using one
hot encoding (Fig. 4). It is a method to better express
categorical variables. First, the feature timestamp is
divided into the weekday and hour associated with
each variable.
For all variable instances, an additional column is
added to the dataset. That is, fourteen Asset IDs with
range (Asset ID0: AssetID13), twenty-four hours
with range (hour0: hour23) and seven weekdays with
range (weekday0: weekday6). For all asset values in
a particular row, only the specific column to which
the asset belongs would be 1, and the others would be
0. This saturates the dataset with a precise value over
the whole range of categorical variables where it is on
only at the given AssetID, hour, and weekday.
SIMULTECH 2023 - 13th International Conference on Simulation and Modeling Methodologies, Technologies and Applications
354