Authors:
Erola Pairo
1
;
Santiago Marco
1
and
Alexandre Perera
2
Affiliations:
1
Universitat de Barcelona, Spain
;
2
CIBER de Bioingeniera, Materiales y Nanomedicina (CIBER-BBN), Spain
Keyword(s):
Transcription factors, Binding sites, Numerical DNA, Principal components analysis, Missing values, BPCA.
Related
Ontology
Subjects/Areas/Topics:
Algorithms and Software Tools
;
Bioinformatics
;
Biomedical Engineering
;
Biostatistics and Stochastic Models
;
Genomics and Proteomics
;
Sequence Analysis
Abstract:
Transcription Factor binding sites are short and degenerate sequences, located mostly at the promoter of the gene, where some proteins bind in order to regulate transcription. Locating these sequences is an important issue, and many experimental and computational methods have been developed. Algorithms to search binding sites are usually based on Position Specific Scoring Matrices (PSSM), where each position is treated independently. Mapping symbolical DNA to numerical sequences, a detector has been built with a Principal Component Analysis of the numerical sequences, taking into account covariances between positions. When a treatment of missing values is incorporated the Q-residuals detector, based on PCA, performs better than a PSSM algorithm. The performance on the detector depends on the estimation of missing values and the percentage of missing values considered in the model.