Later in order to improve the AM model, the
analysis of the individual elemental contributions is
furthered which is to estimate the errors may be
caused from C, H, N, O and S.
2 EXPERIMENT AND METHOD
Comprehensive simulation analysis can help us find
the essential pattern hidden behind the complicated
data sets in most cases. To find out the regular
patterns of mass errors estimated when applying the
averagine model on all human proteins, an in-house
program was developed using the MATLAB toolbox
which has multiple functions and bioinformatics tools
that can deal with massive amount of protein data, as
well as its capacity of transferring massive digital
results to visualized diagrams such as scatters and
bars conveniently.
The averagine model is used in the experimental
fundamentals which in this case offers the basic idea
of how to estimate unknown large proteins as well.
Here the molecular information for each protein
in human protein database were utilized and then the
estimated masses were compared with the actual
theoretical masses calculated using the formula
provided from the database. Both the average mass
errors and the monoisotopic mass errors are obtained
along with the different mass ranges.
All the statistical calculations presented here are
based on Human protein database, which is a
collection of 20,341 sequences of proteins (June,
2019). The primary task of our study is to get the mass
error distribution covered the full mass range, which
will provide the experimental foundation to improve
AM by reducing its estimated errors when applied to
large proteins with MW larger than 30 kDa.
2.1 Main Analysis Process
To get the estimated mass errors, four computational
steps are conducted as below (figure 1):
Step 1: Computing every formula of protein in the
Human Protein Database;
Step 2: Using the obtained formula result from the
first step and the emass algorithm to compute the
theoretical isotopic distributions;
Step 3: Using the AM and the average mass
provided in the second step, estimate the formula
for each protein;
Step 4: Generating two types of mass errors, i.e.,
average mass errors and monoisotopic mass errors.
Figure 1: Diagram of the four computing steps.
Figure 2: Key process of AM application.
Although average mass is widely used for large
molecule mass estimation, the monoisotopic mass
still represents the most accurate mass for a
compound.
Here in this experiment, two sets of errors are
computed through the four computing processes
introduced which are monoisotopic and average
element mass. (figure 2)
The reason why for taking both errors in
consideration is that the former error could offer hints
on how to improve AM while the latter error offers
the information related to the unknown large
molecules validated by the information from the
database.
2.2 Simulation on the Estimated Mass
Errors for All Proteins from
Human Database
We statistically computed two types of mass errors
between Averagine-fit and theoretical isotopic
clusters. According to the distribution, we then
compared the differences between the mass error
ranges for both average masses and monoisotopic
masses. The results showed that the mass accuracy
can be improved remarkably for large proteins in
terms of the monoisotopic mass errors.
However, this is not enough for high-resolution
mass spectrometers, therefore, futher analysis of the
elemental contribution are provided to estimate the
mass errors from all individual elements which are C,
H, N, O, and S.
More detailed results will be shown in next
section.
3 RESULT AND CONCLUSION
As stated previously, the estimated average mass