APPLICATION OF COMBINATORIAL METHODS TO PROTEIN

IDENTIFICATION IN PEPTIDE MASS FINGERPRINTING

Leonid Molokov

Department of Computer Science and Engineering, Chalmers University of Technology, Gothenburg, Sweden

Keywords:

Bioinformatics, Proteomics, Peptide mass ﬁngerprinting, Error removal, Union editing, Set cover, Hitting set,

Simulations.

Abstract:

Peptide Mass Fingerprinting (PMF) for long has been a widely used and reliable method for protein identiﬁca-

tion. However it faced several problems, the most important of which is inability of classical methods to deal

with protein mixtures. To cope with this problem, more costly experimental techniques are employed. We

investigate, whether it is possible to extract more information from PMF by more thorough data analysis. To

do this, we propose a novel method to remove noise from the data and show how the results can be interpreted

in a different way. We also provide simulation results suggesting our method can be used for analysis of small

mixtures.

1 INTRODUCTION

Proteomics is an important and fast developing ﬁeld

in bioinformatics, concerned in protein identiﬁca-

tion and quantiﬁcation. As proteins are complex

molecules, often one can only measure the proper-

ties of their side-products or components. Proteomics

studies how to produce smaller molecules, what prop-

erties of theirs to investigate and how to recover infor-

mation about initial proteins from these known prop-

erties. One of the most extensively used techniques

in the ﬁeld is peptide mass ﬁngerprinting (PMF). The

key idea of the method is to break proteins into pieces,

measure masses of the fragments and use them to

identify the initial proteins. Protein fragmentation is

believed to preserve through repeating experiments

thus making possible analysis of the observed set of

masses. Although in past decades the method was

widely used, now it loses to shotgun proteomics, be-

cause of these two major problems:

1. Mass ambiguity. There are peptides with nearly

identical masses, with difference being less than a

sensitivity of the measurement tool.

2. Single protein analysis. Many widely-used anal-

ysis methods such as Mascot (Perkins et al.,

1999), Ms-Fit (Clauser et al., 1999) and ProFound

(Zhang and Chait, 2000) treat the sample as if it

was a single protein. Analysis of more complex

mixtures is deemed to be too difﬁcult due to rea-

sons that we are going to discuss below.

Mass ambiguity leads to less speciﬁcity in protein

identiﬁcation, and is an experimental setup problem,

which we hardly can solve. On the contrary, single

protein limitation is put by the method interpreting

the data, what was already recognized in earlier works

(He et al., 2009),(Jensen et al., 1997) with attempts to

provide solutions for protein mixtures.

But why was this limitation put? Consider a set

of masses M = {m

, m

, . . . , m

} observed in the ex-

periment. Let us make an assumption that the exper-

iment was held in perfect conditions and each partic-

ular mass m corresponds to some protein p, whose

mass spectrum M

⊂ M, is fully represented in the

experimental spectrum. It is also reasonable to put

some limitations on the cardinality of initial set of

proteins P = {p

, p

, . . . , p

}, that produced spectrum

M: we assume n ≤ c (where c is some integer con-

stant). We also have a collection of theoretical spec-

tra M

, p ∈ P

for each protein p in database. One

can see that the question, whether there exist a set of

proteins P

such that card(P) ≤ c and M ⊆

p∈P

explaining the spectrum M, is precisely a SET COVER

problem, which is known to be NP-complete. We

have an enumeration problem, i. e. describe all the

possible satisfying sets (if there are any) which is

even harder. Nevertheless, simulations show that in

practical cases, most proteins have unique masses,

what means quite optimistic real running time. We

will cover that in section ”Simulation”. Worst case

analysis is not hopeless as well, if one ﬁxes the

307

Molokov L..

APPLICATION OF COMBINATORIAL METHODS TO PROTEIN IDENTIFICATION IN PEPTIDE MASS FINGERPRINTING.

DOI: 10.5220/0003102703070313

In Proceedings of the International Conference on Knowledge Discovery and Information Retrieval (KDIR-2010), pages 307-313

ISBN: 978-989-8425-28-7

 2010 SCITEPRESS (Science and Technology Publications, Lda.)

maximum number of occurrence (degree) of an el-

ement (mass) in different sets (proteins), the prob-

lem is ﬁxed-parameter tractable (FPT), for details,

see (Damaschke and Molokov, 2009). It seems that

masses with high degrees are not informative. In

extreme case where a degree of particular mass is

equal to number of all possible proteins (omnipresent

mass), it can be safely removed from problem formu-

lation, as any protein mixture will explain it. Other-

wise, almost for sure there exists a mass that explains

a subset of proteins that mass with big degree does.

What leads us to conclusion, that one can drop high-

degree masses from consideration (and thus bound the

the maximum degree of an element). Also, obtained

solutions can be veriﬁed later to explain these masses.

The aforementioned computational problem sur-

prisingly reappears in works dedicated to protein in-

ference in shotgun proteomics, (Nesvizhskii and Ae-

bersold, 2005; Aebersold and Mann, 2003), where in-

stead of mass spectra, the peptide sequences are used.

Again the goal is identiﬁcation of proteins, and the

problem was recognized as NP-hard in (Aebersold

and Mann, 2003).

One can see, that in both cases, whether the

proteins are being identiﬁed by observed masses or

peptides sequences, the presence of errors should

be taken into account. Indeed, some observations

(masses, sequences) could be lost or altered the way

that equipment cannot detect them, and some repre-

sent noisy peaks, peptide fragments that were inserted

as a consequence of ﬂaws in the experiment. A sig-

niﬁcant amount of noise can be removed as described

in (Samuelsson et al., 2004), but these procedures do

not guarantee error-free output. A way to ﬁx it is to

reformulate the problem as SET COVER WITH MISS-

ING ELEMENTS : given a spectrum M, a database

, p ∈ P

of theoretical mass spectra and numbers

k, c, ﬁnd out if there’s a set of proteins P

s. t.

card(P) ≤ c, |M△

[

p∈P

| ≤ k.

Let us refer to it in future as SCME. If k = 0, this is a

classical SET COVER problem.

Unfortunately if k ≥ 10, resulting space of pos-

sible solutions contains so many sets and elements,

that it is infeasible to ﬁnd solutions in reasonable

time. It might be the reason why scoring candidate

set is preferred to exact enumeration of solutions in

both classical single-protein schemes and in attempts

to cope with more complex mixtures, (He et al.,

2009),(Jensen et al., 1997). There is an implicit as-

sumption that there exists a set of proteins (or a single

protein) that explains the data best; hence, a scoring

function is introduced, and a set of proteins on which

the function reaches its optimal value is searched (un-

fortunately, it has an expected problem, that only local

optimum can be guaranteed). We considered a differ-

ent possibility, that there might be several such sets,

and it might be interesting to observe all possible so-

lutions, to see, e.g., which proteins present in all or in

most of solutions (omnipresent proteins), which seem

to appear one instead of another, and which are only

observed in a few solutions. The ﬁrst type of proteins

are the most important ones, because if one assumes

that initial set of proteins is among listed solutions,

then omnipresent proteins are contained in initial set

of proteins as well.

One way to reduce the complexity of SCME is to

use some extra information from the data. We already

know that without errors, the mass spectrum should

be precisely the union of the theoretical spectra of

some set of proteins. This knowledge gives rise to

a different approach: ﬁrst, remove the errors, then try

to solve the now simple (according to what simula-

tions show) problem. To remove errors, we want to

solve another computational problem, namely UNION

EDITING, which we formulate below: Given set M, a

collection of its subsets M

, p ∈ P

and integers k

and l, ﬁnd some complete union M

′

which is obtained

from M by at most k insertions and l deletions, that is,

′

\ M| ≤ k, |M \ M

′

| ≤ l.

(A complete union notion will be deﬁned in ”Formal-

ization” section for the sake of consistency.) Again, in

the enumeration version we want all such sets M

′

. It

appears that a simple branching algorithm for solv-

ing UNION EDITING, that we will show later, works

successfully and sufﬁciently fast.

Therefore, the approach of enumerating the pos-

sible satisfying protein sets seems promising to us.

We are going to present algorithm solving the prob-

lem with simulation results saved for the last section

of this article.

2 FORMALIZATION

2.1 Preprocessing

It is time go into more details about how one would

formulate the biological problem in mathematical

terms. We will refer to protein set (mixture) P =

, p

, . . . , p

} as initial protein set. As a result of

mass spectroscopy, P generated a set of masses M =

, m

, . . . , m

} (theoretical spectrum of P), which

was changed by experimental errors into M

′

(empir-

ical spectrum of P). We want to understand what P

was.

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

308

Let us say that P (set of proteins) explains M (the-

oretical spectrum of P), if M =

p∈P

. If, however,

there is a P

′

⊂ P, that explains M fully, i. e. produces

the same spectrum, it is impossible to say, given the

data, whether P or P

′

generated M. We will use Oc-

cam’s Razor principle in a similar way as in (Nesvizh-

skii and Aebersold, 2005). More precisely, we think

it is reasonable to seek only minimal protein sets, that

is, sets with the property that if one vertex is removed

from the protein set, it no longer explains M. It is also

clear, that one has to limit the size of candidate mix-

tures, because experimental setup attempts to reduce

the number of proteins under investigation.

As was mentioned before, in our model we as-

sume that throughout the experiment, ideal spectrum

M was altered by k insertions and and l deletions. Let

us denote set of inserted masses by M

ins

and set of

deleted masses by M

del

el ⊆ M). What we in real-

ity observe is empirical spectrum M

′

, from which we

want to infer what possible spectra M could be. Some

insertions can be removed almost immediately, for

the ﬁrst preprocessing steps, see (Samuelsson et al.,

2004), however, some will still remain. Unless we

specify boundaries for possible amounts of errors (re-

ﬂecting the quality of experiment), it is impossible to

say, what M was, because M

′

could drift too far from

it’s ancestor. I.e. we have to make the additional

assumption, that k and l are limited by constants k

and l

. As soon as we do it, however, we get an im-

mediate outcome: none of the proteins P

big

: {p ∈

|card(M

′

) > l

or card(M

′

) > k

} were

in P. It is a useful observation, because if a mass can

be only produced by proteins in P

big

, it could not be

in M.

In other words, we have a hypergraph H, whose

vertices represent masses, and each of its edges rep-

resents protein. A vertex representing a mass is inci-

dent to an edge representing a protein iff the protein

has the mass in its spectrum. Essentially, H = (V, E),

where V = M

all

, E = P

, where M

all

represent all

the masses in the database. Given M

′

⊆ M

all

, a set of

observed masses, k

and l

as integer boundaries for

the numbers of insertions and deletions respectively,

ﬁnd M ⊆ M

all

such that |M\M

′

| < l

, |M

′

< k

and M =

p∈P

, where P

is some subset of P

In this error correction problem we do not limit the

size of P

by constant c, as it can be done on a later

stage, where one solves the Set Cover problem for the

recovered masses spectra.

As was said, the list of possible M for H and for

′

= (M

all

, P

big

) is the same (follows immedi-

ately from problem formulation), what gives substan-

tial reduction of edges. Can we do anything else? The

answer is yes, but we have to make an additional as-

sumption. Consider a mass (vertex) m ∈ M

all

. If m

is not adjacent to any mass (vertex) from M in hyper-

graph H

′

, then it is one of thousands of other masses

in the database, about presence of which we have no

available evidence. We assume that no such mass was

initially in the experiment, because we cannot make

any deﬁnite conclusion about these masses. It will al-

low us to shrink the hypergraphto edges adjacent only

to M

′

. Let us denote such a hypergraph H

′

The aforementioned assumption is strong and can

lead to exclusion of some proper solutions. How dras-

tic is the difference to full list of solutions? Imagine

we indeed have M

iso

⊆ M : M

iso

∩

e∈H

′

)

e =

I.e. edges of H(M

′

) and H(M

iso

) do not intersect

(they form different connectivity components). Let

us imagine we are solving the full problem, consider

hypergraph H(M

′

) ∪ H(M

iso

). After union editing,

which can only add or remove vertices, these two

components will be still disconnected, what means

that the SET COVER instance will contain also two

connectivity components in its hypergraph. It is ob-

vious that every solution for this instance of SET

COVER can be split into two parts, each of which ex-

plaining one component. If a solution part size for

H(M

iso

is small and ﬁxed, solutions for another com-

ponent will represent a part of some solution for the

whole hypergraph. It suggests that as a result of our

limitation, we can only lose some proteins from the

solutions, but we can’t get any extra unwanted pro-

teins.

To summarize, we have made the following as-

sumptions:

• The mixture size is limited by constant, i. e. Set

Cover size is limited by parameter c.

• The true mixture was altered by a limited number

of k

insertions and l

deletions

• Proteins are observed at least partially, i. e. there

are no entirely lost proteins (if so, we simply ig-

nore them). This does not affect what we learn

about other proteins.

2.2 Algorithm for Union Editing

Enumeration

In this subsection we will give a simple branching

algorithm that is able to solve the UNION EDITING

problem.

Let us call a vertex v of some set M

′

complete in

hypergraph H = (V, E) (where not necessarily M

′

⊆

V), if ∃e ∈ E s.t. v ∈ e, e ⊆ M

′

. Remember, that we

are interested in sets M

′

that are a union of some

subfamily of M

, p ∈ P

, let us call such sets com-

plete unions. It follows from deﬁnition that M

′

APPLICATION OF COMBINATORIAL METHODS TO PROTEIN IDENTIFICATION IN PEPTIDE MASS

FINGERPRINTING

309

complete union iff every vertex v ∈ M

′

is complete.

That immediately gives an idea of a branching algo-

rithm: to guarantee that a particular internal vertex

v is complete, we have to either remove v from M

′

;

or choose some edge e, v ∈ e and add all its external

vertices (that were not in M

′

). Whenever we remove

the vertex, we assume it was added by mistake, thus

we should count this occurrence as a separate inser-

tion. Same way, when we add the vertex, we consider

it was deleted erroneously, and thus we should count

it as a deletion. Therefore, throughout the branching

process, we shall maintain counters for allowed inser-

tions and deletions (k and l are given in the problem

instance). The moment some counter turns to be neg-

ative, the current branch should be terminated.

Pseudocode of the algorithm is given below:

Input: hypergraph

H = (V, E)

integers

(insertions, deletions),

set of initial vertices

M ⊆ V

Find all complete unions

′

⊆ V

|M \ M

′

| ≤ k

and

′

\ M| ≤ l

begin

FindCompleteUnion(

′

)

end

FindCompleteUnion(

′

)

begin

M =

and

0 ≤ kl

output

′

terminate

end if

while

k ≥ 0

and

l ≥ 0

choose arbitrary vertex

in graph

B :=

for all edges

e ∈ E

incident to

′

{

vertices incident to

e} \ M

FindCompleteUnion(

′

(V, E)

l − |A

′

∪ A

′

)

B∪ {e}

end for

FindCompleteUnion(

′

(V, E\B)

k− 1

M\{v}

′

\{v}

)

end while

end

Note that we stop whenever we ﬁnd any feasible solu-

tion, even if k and l are not exhausted. By doing that,

we stick to the previously chosen approach: ﬁnd only

minimal solutions of the instance. But what would

minimality mean in this case?

In the introduction section, we mentioned, that we

want to use only information coming from the empir-

ical spectrum. What means, if ∃e ∈ E : e ∩ {v} =

e ⊆ M

′

, i.e. if there is an edge in M’ that is not ex-

plained by an initial set of vertices M, then M

′

is not

a minimal solution. Note, that it can be obtained from

solution that does not have such edges by several re-

movals of vertices - one can say, that non-minimal

solution assumes more deletions then necessary.

Imagine an opposite situation, that a solution adds

more insertions, than it should be. Keeping in mind

that all solutions are complete unions, we can con-

clude that all vertices which are removed in non-

minimal solution, were complete (we remove only

vertices from M, and if vertex v ∈ M was not com-

plete, then it should be either completed by other ver-

tices or removed in every possible solution). Hence,

we obtain another condition: none of the initially

complete vertices should be removed.

Deﬁnition 1. Solution M

′

of instance of UNION

EDITING (H = (V, E),k, l, M) is not minimal if ∃M

′′

— another solution, such that (M

′′

= M

′

\e, e ∈ E,

e∩M =

0) or (M

′′

= M

′

∪ e, e ∈ E, e ⊆ M). All other

solutions are minimal.

Theorem 1. FindCompleteUnion(H, k, l, M, M) out-

puts solutions of (H, k, l, M) instance of UNION

EDITING and among them lists all minimal solutions.

We do not mention any upper bounds on running

time. It can be shown that CLIQUE can be reduced

to a special case of UNION EDITING, thus the prob-

lem is (not surprisingly) NP-hard. What means it will

be hard to get optimistic bounds for the general case.

For our inputs though the algorithm runs fast enough

to forget about computational complexity (which it-

self is an interesting separate topic to discuss for this

problem), and these results we are going to cover in

the next section.

3 SIMULATIONS

The main question of the simulations was whether it

is feasible to get a list of minimal solutions for regu-

lar problem instance in reasonable time. We are also

interested in the false positives and in the recall (here,

the rate of detected proteins from the initial protein

set), as the practical motivation dictates us to maxi-

mize the latter and get rid of the former.

To answer these questions, we generated uni-

formly at random 10 initial protein sets of sizes 5,

10, 15 and 20. Then we obtained a theoretical mass

spectrum from our database (see ”Data” subsection)

which we modiﬁed by deleting 1 ...9 of the theoret-

ical masses uniformly at random and then inserting 1

...9 masses, also chosen uniformly at random. Here

our simulation approach signiﬁcantly deviated from

standard (well described in (He et al., 2009)), where

random numbers are inserted instead of masses from

the database and masses are modiﬁed by noise instead

of being deleted. The noise rate is, of course, higher,

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

310

reaching 50 % of noise peaks. One can translate this

procedure into ours by accepting masses within some

sliding window: i.e. consider mass m observed if

∃m

′

∈ M

all

: |v(m)−v(m

′

)| < d, where v(m) represent

the actual mass value and d is the sliding window pa-

rameter, M

all

is the set of all known masses. This way

we will get also the initial spectrum modiﬁed by few

insertions and deletions but the distribution of these

insertions/deletions will be different.

We did not exclude the masses outside of de-

tectable range (usually around 500-5000 Da). Of

course, it gave us more speciﬁcity, because we should

have also removed them from the database, thus mak-

ing protein theoretical spectra less distinguishable.

However, it is rather an equipment problem, than

computational, because if we lose elements and thus

decrease the cardinality of sets in an instance of ei-

ther SET COVER or UNION EDITING, they cannot be-

come harder. And we still would list all the possible

solutions. So the undetectable masses were kept to

increase the computational challenge of the problem.

As soon as the empirical sample is generated, we

apply an implementation of algorithm described in

previous section to it. Because error bounds are in

real cases unknown, we used maximal parameter and

twice the maximal parameter as a bound. I.e. if 3

insertions and 0 deletions were introduced, we gave 6

insertions and 6 deletions as input. We also paid at-

tention to the few insertions (0 ...3) case as it seem to

describe best the real situation - noise can be ﬁltered

in many ways, including knowledgeof the protein ori-

gin. E. g. if we know that the mixture contained only

yeast protein, then masses corresponding to peptides

from other organisms can be removed, etc.

Although the initial plan was to formulate outputs

of UNION EDITING instances as instances of SET

COVER problem and solve them by branching algo-

rithm described in (Damaschke and Molokov, 2009),

it turned out later that this is not the best idea. As the

number of the obtained solutions (possible theoretical

spectra) for UNION EDITING instances with high er-

ror bounds is too big, we would waste too much time

on going through all of them. First we need to do

some categorization of errors, as there are might be

omnipresent masses, etc. - the same way it was with

proteins. It is still an open question, how one can ex-

actly do it, but it seems rather time consuming to ﬁnd

set covers for all possible theoretical spectra. Accord-

ing to running times, given in tables 1 and 2, exclud-

ing the time spent for sample preprocessing, each full

run would take more than a day!

Instead, we solve the SET COVER only for the ini-

tial protein sets. More precisely, as an input we used:

universe M (union of mass spectra of initial proteins),

and sets {p ∈ P

∩ M} as covering sets. Param-

eter c is not known and should be selected by the in-

vestigator. We used value 20 for it.

We will cover the results of both algorithms in

subsection ”Performance”.

3.1 Performance

For all drawn samples, SET COVER instances ap-

peared to be trivial - they were solved in a matter

of seconds. The reason behind it was that proteins

had few shared masses and there were many unique

masses - the ones that appear only in one protein

among chosen candidates. Results for one of the sam-

ples are given in table 1.

The running times of implementation of algorithm

from (Damaschke and Molokov, 2009), suggest that

we can use it as a tool to solve SET COVER instances

on a regular basis.

Table 1: Results of implementation of the branching algo-

rithm for SET COVER enumeration.

|P| c Time Recall

20 20 1.934s 85%

15 20 2.075s 80%

10 20 0.982s 80%

5 20 0.244s 80%

In this table, |P| represents cardinality of the initial

set of proteins (sample size), c represents a parameter,

the bound for solution size, Time stands for running

time and Recall is obtained as a rate of detected pro-

teins in initial set of proteins. But what proteins are

counted as detected? If we consider only ones that are

present in all solutions as hits, then as was said in in-

troduction, we can guarantee that every such protein

was in initial set of proteins. Therefore, the number of

false positives obtained by our method will be 0 (it is

not provided in table). Recall, however, reduces. Al-

though we list the initial set of proteins in the solution,

not all it’s proteins are omnipresent.

The following table summarizes results of the

branching algorithm given in previous section.

Here ”Sample” column contains sample ids, k

and l are numbers of actual insertions and deletions,

whilst k

and l

are bounds speciﬁed for them.

First part of the table demonstrates the depen-

dency between running time and the number of initial

proteins |P|. Although there were no insertions made

at all, high bound for their number lead to high run-

ning time, which rapidly increases with growing sam-

ple size. Second part of the table proves that it is not

the actual number of errors that is important for the

running time, but the error bounds we provide. E. g.

APPLICATION OF COMBINATORIAL METHODS TO PROTEIN IDENTIFICATION IN PEPTIDE MASS

FINGERPRINTING

311

Table 2: Results of implementation of the branching algo-

rithm for UNION EDITING.

Sample k l |P| k

Time

1 0 9 5 9 9 0.172s

1 0 9 10 9 9 19s

1 0 9 15 9 9 1102s

1 0 9 20 9 9 3572s

1 0 4 15 8 8 1260s

3 6 6 15 6 6 181s

2 9 9 5 9 9 0.634s

2 9 9 5 18 18 39s

for 12 errors in sample 3, we assume only 12 (6 inser-

tions and 6 deletions), compared to 18 (9 insertions

and 9 deletions) in previous part of the table. Third

part of the table gives the most optimistic results. If

|P| is sufﬁciently small, even with high error bounds,

reaching 36 - approximately 20% of initial sample

size, running time is within one minute range. That

means that our approach can be used directly for clas-

sical PMF, where a low number of proteins in mixture

is guaranteed by separation procedures.

3.2 Discussion

Results raise several questions, which are not yet an-

swered. The ﬁrst one and the most important would

be: is it possible to reduce the running time for big

initial protein sets? It is clear we will get some ex-

ponential function, but the base of exponent could be

improvedby intelligent branching rules. We may con-

sider using techniques similar to ones used for solving

HITTING SET problem in (Fernau, 2006). We may

also save time by ﬁnding a way to browse through so-

lution space (list of possible solutions for high error

bounds reaches order of 10

) of UNION EDITING.

Another good question is, how well the method

behaves on real data. It is hard to assess what are

the real numbers of insertions and deletions in regular

experiments, butone could try solving the instance for

known protein mixtures.

Finally, it might be good to check whether it is

possible to reformulate problem into counting and

thus try some approximation scheme for SCME. Al-

though we can’t get a good polynomial time ap-

proximation scheme (Lund and Yannakakis, 1994),

some randomized approximation scheme might work

in polynomial time.

3.3 Data

Protein database uses protein sequences from Swis-

sProt version of 2005, with humans as species of

origin. Theoretical spectra were obtained by sim-

ulating trypsin digestion of proteins with no mis-

cleavages (what assumes that trypsin always cuts

where it should and does not miss any markers).

For computations we used regular machine with In-

tel Core2 Duo T5800 processor, 2.00GHz with 3GB

RAM available.

4 CONCLUSIONS

Simulations show that the method can be applied to

standard PMF setup with small protein mixtures to

improve the results obtained by classical methods.

There is also a perspective of application in shotgun

proteomics, which considers bigger protein mixtures.

It requires modiﬁcations of the algorithm for solving

UNION EDITING that will reduce its running time. A

more intelligent way of enumeration of outputs of the

algorithm might also provide an insight to a faster so-

lution.

ACKNOWLEDGEMENTS

Work was partially supported by Swedish Research

Council (Vetenskapsradet), grant no. 2007-6437,

”Combinatorial inference algorithms : parameteriza-

tion and clustering”. Author also would like to ex-

press gratitude to Swedish Institute for former support

and funding, and to his supervisor, Peter Damaschke,

for valuable input in research and studies.

REFERENCES

Aebersold, R. and Mann, M. (2003). Mass spectrometry-

based proteomics. Nature, 422:198–207.

Clauser, K., Baker, P., and Burlingame, A. (1999). Role of

accurate mass measurement (+/- 10 ppm) in protein

identiﬁcation strategies employing ms or ms/ms and

database searching. Anal. Chem., 71(14):2871–2882.

Damaschke, P. and Molokov, L. (2009). The union of mini-

mal hitting sets : combinatorial parameterized bounds

and counting. J. Discrete Algorithms, 7:391–401.

Fernau, H. (2006). Parameterized algorithms for hitting set:

The weighted case. In CIAC’06, volume 41, pages

332–343.

He, Z., Yang, C., Yang, C., Qi, R. Z., Tam, J., and Yu,

W. (2009). Optimization-based peptide mass ﬁnger-

printing for protein mixture identiﬁcation. In RE-

COMB’09, pages 16–30. LNCS 5541.

Jensen, O. N., Podtelejnikov, A. V., and Mann, M. (1997).

Identiﬁcation of the components of simple protein

KDIR 2010 - International Conference on Knowledge Discovery and Information Retrieval

312

mixtures by high-accuracy peptide mass mapping and

database searching. Anal. Chem., 69(23):4741–4750.

Lund, C. and Yannakakis, M. (1994). On the hardness of

approximating minimization problems. Journal of the

ACM, 41(5):960–981.

Nesvizhskii, A. I. and Aebersold, R. (2005). Interpreta-

tion of shotgun proteomic data: The protein inference

problem. Mol. Cel. Proteomics, 4:1419–1440.

Perkins, D. N., Pappin, D. J. C., Creasy, D. M., and Cot-

trell, J. S. (1999). Probability-based protein identi-

ﬁcation by searching sequence databases using mass

spectrometry data. Electrophoresis, 20:3551–3567.

Samuelsson, J., Dalevi, D., Levander, F., and Rognvaldsson,

T. (2004). Modular, scriptable and automated analysis

tools for high-throughput peptide mass ﬁngerprinting.

Bioinformatics, 20(18):3628–3635.

Zhang, W. and Chait, B. T. (2000). Profound: an expert

system for protein identiﬁcation using mass spectro-

metric peptide mapping information. Anal. Chem.,

72(11):2482–2489.

APPLICATION OF COMBINATORIAL METHODS TO PROTEIN IDENTIFICATION IN PEPTIDE MASS

FINGERPRINTING

313