expressed genes that commonly arises in the search
for disease biomarkers. GEDP answers are
presented, in turn, as a M × N array B = [B(k, j)],
where B(k, j) = 1 if the k-th gene is expressed at
condition j and B(k, j) = 0, otherwise. Because of the
currently limited knowledge of the cell inner
mechanisms and the stochastic nature of the events
that lead to gene expressions, matrix B is more a
hypothesis on the states and transitions of the gene
expressions than a deterministic fact. Nonetheless,
these hypotheses are often formulated with the help
of deterministic data analysis algorithms that mine
each expression profile in G for signals of an
expression threshold t. Once a threshold t is
determined for the expression profile of gene k, the
k-th row of B is produced by assigning 1 to the j-th
entry if G(k, j) > t, and 0, otherwise.
Several algorithms based on different data
mining methodologies and conjectures on the
features in the data that signal an expression
threshold, have been designed. Their results are
often significantly different (Seguel et al., 2013).
In this article, I propose a wisdom-of-crowds
methodology for aggregating these algorithmic
decisions. The methodology is based on a
mathematical structure that I call multi-algorithm
aggregation scheme (MAS). MAS is inspired in the
logic underlying collective decision-making by
voting. MAS is a true alternative to average, median,
and other common aggregation formulas, as it
provides flexibility to select the voting method and a
decision-making rule, referred as doctrine. This
flexibility turns the method into an analytical tool;
capable of testing the data with different decision-
making parameters. As a mathematical structure,
MAS can be used in applications other than gene
expression decisions.
The rest of this article is organized as follows:
Section 2 is a brief description of the algorithms
selected for the proposed multi-algorithmic scheme,
together with some basic time and space complexity
analysis. Section 3 is a mathematical description of
MAS and its implementation for solving the gene
expression decision problem. Section 4 reports the
results of experiments and comparisons between
MAS and other aggregation rules, and Section 5
summarizes some conclusions of this work.
2 SOME GENE EXPRESSION
DECISION ALGORITHMS
The gene expression decision algorithms that are the
basis of the proposed multi-algorithm method can be
classified in three main groups. The first group,
referred as jump-based methods, consists of four
algorithms that determine the threshold on the basis
of a jump in the values of the gene expression
profile. Methods in this group are labelled J1, J2, J3
and J4. The second group consists of three
algorithms that determine a threshold on the basis of
approximations to the gene expression profile by
one-step functions. These algorithms are denoted S1,
S2 and S3 and are called one-step methods. The
threshold returned by one-step methods is the
midpoint of the steps in the one-step approximation
mapping whose values are further apart.
The third group consists of two data clustering
methods, both based on Lloyd’s algorithm. These
methods are labelled C1 and C2. Next are high-level
descriptions of each of these methods.
2.1 Jump-based Methods
Algorithm J1 sorts the input expression profile in
increasing order, and sets as threshold the midpoint
between the smallest and the highest jump in the
data. Algorithm J2 is introduced in (Shmulevich et
al., 2002) The method sorts the expression profile in
increasing order and computes the average of all
data jumps. Then, it sets as threshold the first value
that exceeds the average. Algorithm J3 is a variant
of Algorithm J2 that replaces the first value that
exceeds the average data jump with the mean of all
the values that exceed the average of the data jumps.
The main advantages of algorithms J1, J2 and J3
are conceptual and computational simplicity. In fact,
they all return the M thresholds of a M × N array G
in O(MN) time, using O(N) space. Algorithm J4 is
more complex. This method is an implementation of
the Binarization Across Multiple Scales (BASC)
algorithm (Hopfensitz et al., 2011). BASC
approximates the input expression profile sorted in
increasing order with a sequence of step functions,
each with a different number of steps. It starts with
the step function that fits exactly the input data.
Then, it produces a sequence of step functions, each
with one less step than the previous one. Dynamic
programming is used to ensure that each new step
function minimizes the Euclidian distance to the
sorted expression profile. For each step function in
the sequence, the ratio between the highest step
jump and the Euclidean distance of the step function
to the input data is computed. A high ratio is
declared to be a strong discontinuity and its index is
saved in a vector v. Then, the method computes the
median m of the indices in v and defines the
threshold as the average of the data point indexed by
m and m + 1.
BIOINFORMATICS2015-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms
110