maximum number of occurrence (degree) of an el-
ement (mass) in different sets (proteins), the prob-
lem is fixed-parameter tractable (FPT), for details,
see (Damaschke and Molokov, 2009). It seems that
masses with high degrees are not informative. In
extreme case where a degree of particular mass is
equal to number of all possible proteins (omnipresent
mass), it can be safely removed from problem formu-
lation, as any protein mixture will explain it. Other-
wise, almost for sure there exists a mass that explains
a subset of proteins that mass with big degree does.
What leads us to conclusion, that one can drop high-
degree masses from consideration (and thus bound the
the maximum degree of an element). Also, obtained
solutions can be verified later to explain these masses.
The aforementioned computational problem sur-
prisingly reappears in works dedicated to protein in-
ference in shotgun proteomics, (Nesvizhskii and Ae-
bersold, 2005; Aebersold and Mann, 2003), where in-
stead of mass spectra, the peptide sequences are used.
Again the goal is identification of proteins, and the
problem was recognized as NP-hard in (Aebersold
and Mann, 2003).
One can see, that in both cases, whether the
proteins are being identified by observed masses or
peptides sequences, the presence of errors should
be taken into account. Indeed, some observations
(masses, sequences) could be lost or altered the way
that equipment cannot detect them, and some repre-
sent noisy peaks, peptide fragments that were inserted
as a consequence of flaws in the experiment. A sig-
nificant amount of noise can be removed as described
in (Samuelsson et al., 2004), but these procedures do
not guarantee error-free output. A way to fix it is to
reformulate the problem as SET COVER WITH MISS-
ING ELEMENTS : given a spectrum M, a database
M
p
, p ∈ P
db
of theoretical mass spectra and numbers
k, c, find out if there’s a set of proteins P
s
s. t.
card(P) ≤ c, |M△
[
p∈P
s
M
p
| ≤ k.
Let us refer to it in future as SCME. If k = 0, this is a
classical SET COVER problem.
Unfortunately if k ≥ 10, resulting space of pos-
sible solutions contains so many sets and elements,
that it is infeasible to find solutions in reasonable
time. It might be the reason why scoring candidate
set is preferred to exact enumeration of solutions in
both classical single-protein schemes and in attempts
to cope with more complex mixtures, (He et al.,
2009),(Jensen et al., 1997). There is an implicit as-
sumption that there exists a set of proteins (or a single
protein) that explains the data best; hence, a scoring
function is introduced, and a set of proteins on which
the function reaches its optimal value is searched (un-
fortunately, it has an expected problem, that only local
optimum can be guaranteed). We considered a differ-
ent possibility, that there might be several such sets,
and it might be interesting to observe all possible so-
lutions, to see, e.g., which proteins present in all or in
most of solutions (omnipresent proteins), which seem
to appear one instead of another, and which are only
observed in a few solutions. The first type of proteins
are the most important ones, because if one assumes
that initial set of proteins is among listed solutions,
then omnipresent proteins are contained in initial set
of proteins as well.
One way to reduce the complexity of SCME is to
use some extra information from the data. We already
know that without errors, the mass spectrum should
be precisely the union of the theoretical spectra of
some set of proteins. This knowledge gives rise to
a different approach: first, remove the errors, then try
to solve the now simple (according to what simula-
tions show) problem. To remove errors, we want to
solve another computational problem, namely UNION
EDITING, which we formulate below: Given set M, a
collection of its subsets M
p
, p ∈ P
db
and integers k
and l, find some complete union M
′
which is obtained
from M by at most k insertions and l deletions, that is,
|M
′
\ M| ≤ k, |M \ M
′
| ≤ l.
(A complete union notion will be defined in ”Formal-
ization” section for the sake of consistency.) Again, in
the enumeration version we want all such sets M
′
. It
appears that a simple branching algorithm for solv-
ing UNION EDITING, that we will show later, works
successfully and sufficiently fast.
Therefore, the approach of enumerating the pos-
sible satisfying protein sets seems promising to us.
We are going to present algorithm solving the prob-
lem with simulation results saved for the last section
of this article.
2 FORMALIZATION
2.1 Preprocessing
It is time go into more details about how one would
formulate the biological problem in mathematical
terms. We will refer to protein set (mixture) P =
{p
1
, p
2
, . . . , p
k
} as initial protein set. As a result of
mass spectroscopy, P generated a set of masses M =
{m
1
, m
2
, . . . , m
c
} (theoretical spectrum of P), which
was changed by experimental errors into M
′
(empir-
ical spectrum of P). We want to understand what P
was.
KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval
308