
 
well as environmental factors. Therefore a subset of 
genes is likely to have similar or nearly identical 
pattern of gene expression if probed under similar 
conditions. Thus, when global gene expression data 
in the form of EST expression levels is compared 
between similarly prepared EST libraries (e.g. non-
normalised preparations) from the identical tissues, 
the Pearson correlation between such libraries is 
likely to be close to "+1", for many genes. 
Previously, some ~1,500 transcripts were 
identified as tissue specific from investigations using 
CGAP’s cDNA DGED (Milnthorpe and Soloviev, 
2012). This was optimised further by summing 
together all the libraries in each tissue to make a 
super-library. All possible Pearson correlations were 
calculated between all super-libraries (equation 1). 
, 
∑
∑
∑
∑
 
(1)
Where  x and y are the total EST counts for the 
transcript concerned in super-libraries X and Y,  m 
and n are the mean EST counts across all transcripts 
in super-libraries X and Y, and Correl(X,Y)  is the 
calculated Pearson Correlation Coefficient between 
the two super-libraries X and Y. 
Higher correlation value means higher inter-
tissue correlation and is undesirable; ideally all 
correlations should be equal to "0". Hence sum of 
squares values were calculated from the correlations. 
1
 
(2)
Where  Correl is the calculated Pearson Correlation 
coefficient between two super-libraries and S is the 
sum of squares value for the correlations between all 
possible pairs of super-libraries. 
To optimise the initial selection and decrease the 
overall inter-tissue correlations individual genes 
were then removed from the super-libraries and the 
sum of squares values were recalculated. The gene 
whose removal resulted in the lowest overall inter-
tissue correlations was permanently removed and the 
iteration steps were repeated again. The decrease in 
inter-tissue correlations slowed shortly before the 
1,000th gene was removed. The remainder included 
high-quality tissue-specific markers and were 
retained. These were optimised further to improve 
intra-tissue correlation between libraries from the 
same tissue using the original libraries (data not 
shown).  This produced an EST expression matrix 
containing 244 genes. We have earlier reported a 
few applications of the matrix for elucidation of 
tissue identity (Milnthorpe and Soloviev, 2012). 
In order to investigate the robustness of our 
quality control approach based on the developed 
matrix, here we used modelled data to simulate 
small expression datasets. These were generated 
from the expression data, by proportionally reducing 
the reported EST counts and rounding any fractional 
values to the nearest whole count each time. This 
continued until each library ceased to present any 
ESTs mapping onto the 244 marker transcripts or 
ceased to be identified as a positive match for the 
tissue from which it was created. Using this 
approach we scaled down expression datasets and 
compared all of the model libraries with the original 
libraries by calculating the correlation values for the 
genes in our matrix. Virtually every library 
continues to correlate well with the tissue of origin 
until the very last EST mapping onto one of the 
transcripts is removed (a typical outcome is shown 
in Figure 1 for pancreas). Furthermore, the majority 
of the scaled down libraries remain identifiable until 
total EST counts fall below 10 – 50 which is equal to 
some of the smallest libraries in CGAP’s database.  
Our results for pancreas are summarised in Table 
1 which details results for each of the original 
libraries used and model data sets. The initial and 
final numbers of total ESTs are shown and the 
correlation values are indicated for each pair. 
Remarkably, the final counts across all transcripts in 
each library which still yield positive intra-tissue 
correlation are below 100 ESTs for all but 3 libraries 
tested and are below 10 ESTs for 15 out of 33 
libraries tested. The tissue typing quality does not 
change dramatically. These findings show that the 
EST expression matrix can be used to confirm the 
identity of virtually any library including small 
libraries, making it a very robust method for the 
quality control of expression libraries. Similar 
results were obtained for all other tissues tested so 
far: lung, placenta, retina and testis, data not shown.
  
3 DISCUSSION 
We created an EST expression matrix based on 
carefully selected marker genes and demonstrated its 
potential for quality control of EST data and 
elucidation of the tissue identity of uncharacterised 
libraries and cancer staging. The model libraries 
described here were analysed using the matrix. The 
findings presented in Figure 1 and Table 1 and the 
results for the other tissues show that the EST matrix 
can be used to identify the tissue of origin for 
libraries containing as few as 2 ESTs. These findings 
show that tissue-specific gene expression can be 
used as a quality control method, which substantially  
OptimisationandValidationofaMinimumDataSetfortheIdentificationandQualityControlofESTExpressionLibraries
279