dependence of behaviour of one variable based on
another’s behaviour). After the correlation is
computed, some hypothesis testing is done to filter
out only significant correlations. In addition to
significance filtering, filtering via correlation
threshold is typically performed to reduce network
size and remove non-meaningful correlations (such
as those around 0.00).
There are two main ways to filter a network:
hard thresholding or soft thresholding. Hard
thresholding removes edges based on a firm cut-off
value; typically this value falls between the ranges
of -1.00 ≤ ρ ≤ -0.70 and 0.70 ≤ ρ ≤ 1.00. This
threshold is typically chosen as it captures only
relationships that are descriptive of the behaviour of
two genes. For example, a correlation of 0.70 has a
coefficient of determination (R
2
which is equivalent
to ρ
2
) of 49%, meaning that if the correlation reflects
a true relationship, 49% of a given gene’s behaviour
can be attributed to the other gene, and vice versa.
Soft thresholding, popularized by Horvath and
Dong (2008) (called WGCNA), involves identifying
the threshold at which the network exhibits scale-
free properties which some particular networks are
expected to have, and extracting the subnetwork of
the original network such that the filtered network is
scale-free. Thus, comparing two sets of expression
data from the same model and cell line but under
different environmental conditions might involve
using different correlation values based on the soft
thresholding approach.
While many studies have used iterations of the
correlation network model with success, few studies
in network systems in biology have delved into the
robustness of correlations, and how that might affect
network structure. For example, if a sample is
removed from the network, does the correlation that
results remain the same value or does it change
significantly? The correlation, if originally had
fallen within the proposed threshold and after
sample removal failed to fall within the threshold,
might not be representative of a true relationship in
the data. This begs the question: How many samples
are sufficient to assume a robust network? These and
other questions, if answered, can lead to insights
about how to remove noise from a correlation
network, and which relationships can be trusted,
without having to integrate extraneous biological
information. The novelty of this work lies in the lack
of understanding of the stability or by contrast,
vulnerability of the correlation network model.
While correlation does not imply causative
relationship, the measure is still able to capture those
relationships that are causative; in capturing
everything the measure is prone to noise. This
research investigates the possibility of using the
strength of correlation to remove some of that noise
and also can be used as evidence to suggest the
beginning of data-driven experimental studies.
Bioinformatics deals largely with publicly available
data; however, the results of the research here
suggest that we can improve the requirements of
those studies (i.e. increasing sample number) for use
in systems biology.
2 METHODS
Briefly, this work describes a cursory review of the
effect that single sample removal has on Pearson
correlation coefficient in a hard-thresholded setting.
To investigate, networks were created, thresholded,
and then samples were iteratively removed to
determine effect on correlation value.
2.1 Network Creation
Three datasets were chosen to highlight the
difference in sample number; all datasets had 9 or
less samples, reflecting the current state of high-
throughput technology where most expression
experiments contain samples, at minimum, in
triplicate. The datasets chosen were:
GSE5078 (Verbitsky et al., 2004) – Mus
musculus hippocampus mRNA, compared at 2
months and 15 months (Young and Middle-
Aged, respectively). Young dataset contains 9
samples and Middle-Aged dataset contains 9
samples.
GSE5140 (Bender et al., 2008) – Mus musculus
whole brain mRNA, compared at untreated and
creatine-treatment (Untreated and Creatine,
respectively). The Untreated dataset contains 6
samples, and the Creatine dataset contains 6
samples.
GSE46384 (Ikushima and Misaizu) –
Saccharomyces cerevisiae untreated or exposed
to 40g/l of isopropanol, (0IPA and 40IPA,
respectively). The 0IPA dataset contains 4
samples, and the 40IPA dataset contains 4
samples.
A threshold of 0.70 ≤ ρ ≤ 1.00 using Pearson
correlation coefficients was used to find correlated
expression relationships, and p-values were
computing using the Student’s T-test with a
threshold of p-value <0.0005 significance. Network
sizes for each are contained below in Table 1. The
GSE5140 networks were the largest by node count.
OntheRobustnessoftheBiologicalCorrelationNetworkModel
187