The Possibilities of Filtering Pairs of SNPs in GWAS Studies

Exploratory Study on Public Protein-interaction and Pathway Data

Matej Lexa and Stanislav Stefanic

Faculty of Informatics, Masaryk University, Botanick

a 68a, 60200 Brno, Czech Republic

Keywords:

GWAS, SNPs, Biological Knowledge, Databases, Genotyping, Filtering

Abstract:

Genome-wide association studies have become a standard way of discovering novel causative alleles by look-

ing for statisticaly signiﬁcant associations in patient genotyping data. The present challenge for these methods

is to discover associations involving multiple interacting loci, a common phenomenon in diseases often re-

lated to epistasis. The main problem is the exponential increase in necessary computational power for every

additional interacting locus considered in association tests. Several approaches have been proposed to manage

this problem, including limiting analysis to interacting pairs and ﬁltering SNPs according to external biolog-

ical knowledge. Here we explore the possibilities of using public protein interaction data and pathway maps

to ﬁlter out only pairs of SNPs that are likely to interact, perhaps because of epistatic mechanisms working

at the protein level. After ﬁltering all possible pairs of SNPs by their presence in common protein-protein

interactions or proteins sharing a metabolic or signalling pathway, we calculate the possible reduction in com-

putational requirements under different scenarios. We discuss these exploratory results in the context of the

so-called ”lost heredity” and the usefulness of this approach for similar scenarios.

1 INTRODUCTION

Genome-wide association studies (GWAS) have be-

come a standard way of analysing genotyping data to

discover associations between single nucleotide poly-

morphisms or similar variants and phenotype, often

representing a diagnosis or disease status or progres-

sion (Witte, 2010). The common GWAS workﬂow in-

cludes organizing genotyping data into an (m +1)× n

matrix with m SNPs (columns) and n individuals with

a known phenotype in one of the columns. The

data is then analyzed for statistically signiﬁcant as-

sociations between the phenotype and SNP columns.

Commonly, χ

-test with multiple testing correction

is used to discover informative SNPs (Mantel and

Haenzel, 1959) (Huh et al., 2011). To date, 1605

GWAS studies have been deposited in GWAS Central

at http://www.gwascentral.org, reporting P-values for

almost 3 million SNP markers for the studied pheno-

types (Thorisson et al., 2009). A total of 11751 risk

SNPs have been reported from these studies (P-values

below 5.10e

−8

) in 1738 publications, as reported by

the NCBI GWAS Catalog (Hindorff et al., 2009).

There is an ongoing debate among geneticists and

other scientists about ”lost heritability”. Since only

small part of phenotypic variation is explained by sin-

gle SNPs discovered using GWAS, people have been

looking for the lost heritability (Maher, 2008), partly

for intelectual reasons and partly because it is thought

to go hand-in-hand with disease risk (Manolio et al.,

2009). One school of thought argues that it is to be

discovered in interactions between loci or SNPs (van

Steen, 2011). Most of these interactions can be de-

scribed by the well-known genetic mechanism called

epistasis.

Several approaches have been proposed to deal

with epistasis and interacting SNPs. This includes

limiting the analysis to potentialy interacting pairs

that can be predicted from simpler calculations, such

as detecting single, interaction-free SNP-phenotype

associations ﬁrst (Emily et al., 2009) or limiting the

analysis to local chromosomal regions (Slavin et al.,

2011). Even though multiple loci can form an in-

teraction network, most of its properties are proba-

bly already present in pairwise interactions (Liu et al.,

2012)(Hua et al., 2012). Another approach proposes

to ﬁlter the analysed SNP combinations only for those

that (based on our biological knowledge) have a high

enough chance to interact through epistasis (Bush

et al., 2009). Such external (to the study) biologi-

cal knowledge can regard gene regulation and regu-

latory networks, metabolic and signalling pathways,

259

Lexa M. and Stefanic S..

The Possibilities of Filtering Pairs of SNPs in GWAS Studies - Exploratory Study on Public Protein-interaction and Pathway Data.

DOI: 10.5220/0004915002590264

In Proceedings of the International Conference on Bioinformatics Models, Methods and Algorithms (BIOINFORMATICS-2014), pages 259-264

ISBN: 978-989-758-012-3

 2014 SCITEPRESS (Science and Technology Publications, Lda.)

protein-protein interactions, temporal or spatial co-

expression of genes, common functional categories,

such as those deﬁned by Gene Ontology, etc. (Bush

et al., 2009).

Here we explore further the possibilities of using

public protein interaction data and pathway maps to

ﬁlter out only pairs of SNPs that can interact because

of epistatic or other unknown mechanisms working at

the protein level. After ﬁltering all possible pairs of

SNPs by their presence in common protein-protein in-

teraction or proteins sharing a metabolic or signalling

pathway, we calculate the reduction in computational

requirements under different scenarios. Apart from

ﬁltering by biological knowledge, an approach used

by other authors, we suggest the use of graph decon-

volution techniques, as another way to further narrow

the set of possible epistatic pairs in the data to the

most likely causative variants (Feizi et al., 2013).

2 SOFTWARE AND METHODS

2.1 Collection of Data Representing

Biological Knowledge

Biological knowledge is represented in this paper by

a set of protein-centric databases commonly used in

molecular biology to obtain information on protein-

protein interactions (DIP (Salwinski et al., 2004) ,

MINT (Licata et al., 2012)), metabolic and signalling

pathways (Wikipathways (Kelder et al., 2012)) and

biological and molecular function (Gene Ontology).

Data were downloaded in bulk text format and incor-

porated into the analysis as needed and described be-

low.

2.1.1 Protein-protein Interactions

Three databases were used as a source of protein-

protein interaction data. DIP, the Database of

Interacting Proteins (Salwinski et al., 2004)

and two MINT databases from the Molecular

INTeraction database (Licata et al., 2012). We

downloaded human data in tab25 format from DIP

(tab35/Hsapi20130707.txt, July 7, 2013) and

human binary data and complexes in mitab format

(2013-03-26-mint-human-binary.mitab26.txt,

2013-03-26-mint-human-complexes.mitab26.txt).

We used AWK scripts for selecting relevant columns

present, converting each row of the data to an SQL in-

sert statements to populate our working database. In

this manner we created tables diphuman, mint binary

and mint complexes. Further operations with this

data are described in section (2.3).

2.1.2 Metabolic and Signalling Pathways

Data for the presence of proteins and their in-

teraction in common metabolic and signalling

pathways was obtained from Wikipathways at

wikipathways.org (Kelder et al., 2012). The

human pathway data is available in the ﬁle

wikipathways data Homo sapiens.tab. Similar-

ily to the interaction data, the ﬁle was processed with

AWK scripts to generate apropriate SQL commands

for populating our database with pathway member-

ship data. After downloading wikipathway ﬁle, 13

columns were used... The next operations are de-

scribed in section (2.3).

2.2 Mapping SNP IDs to Protein IDs

To allow selection of SNP pairs (or general k-tuples

for k > 2) based on protein biological knowledge

we only considered SNPs located within coding se-

quences (this could be expanded to include poten-

tial regulatory sequences such as promoters or regula-

tory elements in introns or known trans-regulatory el-

ements further away from the respective gene). These

SNPs were then assigned to proteins coded by the se-

quence they reside in. This gives us a direct mapping

between SNP IDs (such as rs2251969, rs952094,

rs75931146, rs78394850), RefSeq gene IDs (such

as NM 003126 and protein IDs (such as Uniprot

P02549, HGNC SPTA1, RefSeq protein NP 003117 or

DIP 1020N).

Speciﬁcally, we found RefSeq gene IDs for each

protein occuring in interactions or pathways and cre-

ated a table that mapped each protein ID to a Ref-

Seq Gene ID. We used UCSC Genome Table Browser

to download SNP IDs and RefSeq coding sequence

IDs in BED format. We used BedTools ((Quinlan and

Hall, 2010)) an their region intersection and merging

capabilities to obtain clean mapping data and included

it in our mysql database.

2.3 Detecting and Counting Biologically

Relevant SNP Pairs

We used mysql operations for organizing all

data about interactions into a single table

(human interaction) where all three types of in-

teractions were included. The resulting table consists

of ﬁve columns which are ID - our internal un-

ambiguous id of interaction, Protein A - the ﬁrst

interacting element participating in the interaction,

Protein B - the second interacting element, int db

- name of one of the three downloaded databases,

where interactions are described and id in db - native

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

260

id of the interaction in the original database named

in the previous column. This cross-reference is kept

for possible future use and was not used in this

analysis. Considering we used three different types

of data where interactions are described and each

type uses different types of protein IDs, we had to

create unambiguous ID for every protein and use this

ID in the ﬁnal human interaction table in columns

Protein A and Protein B. An auxilliary table of

all proteins used in the study (human protein) was

created, where we assigned a unique ID to each

protein occuring in interactions in one of the three

tables. Duplicates occurring because of multiple

RefSeq transcripts covering the same genomic region

were eliminated usingthe UNION SQL operation

in conjunction with unique(). Finally, we used the

human protein table to merge tables diphuman,

mint binary and mint complexes into the ﬁnal table

(human interaction), where all interactions are

preserved and duplicates are eliminated.

Using the data in this table, we created a table

named snp interaction which contained all interact-

ing SNP pairs that could be created from their map-

ping to two interacting proteins. This was accom-

plished with table snp2hgnc containing the mappings

between SNPs and genes in which they occur. Be-

cause the snp2hgnc table contained RefSeq gene IDs,

we had to add RefSeq IDs to the human protein

table using a web identiﬁcator translation service

from EBI. These mappings (SNP to RefSeq ID)

were then recalculated into mappings from SNPs to

each protein occuring in the human interaction table.

The snp interaction table contains the following at-

tributes: ID, SNP A and SNP B (both in the form of

dbSNP rs * IDs). This table therefore contains all po-

tentially interacting SNP pairs (based on the relevant

biological knowledge) and can be counted or read as

needed. In this paper we report some of the counts

and other relevant numbers useful in estimating the

complexity of GWAS after using the pairs for ﬁlter-

ing of SNPs or SNP pairs.

The overall relationships in this kind of data is il-

lustrated in Figure 1, showing the source of biological

knowledge and how it allows us to focus on a subset

of available SNP pairs.

2.4 Evaluation Procedures

To arrive at the main result in this study, the propor-

tion of SNP pairs that can be ﬁltered out by consider-

ing biological knowledge, we calculated the number

of SNP pairs that can be created from the dataset as

snp × (snp − 1)/2. We also counted the number of

unique SNP pairs that fall onto proteins involved in

protein-protein interaction or that are members of a

common pathway. The percentage of the latter against

the former gave us a numerical value for the reduction

as reported in Table 1.

3 RESULTS

We collected information on two different kinds of in-

teractions between proteins in biological systems (di-

rect physical interaction and participation in a com-

mon pathway). In the context of genome-wide associ-

ation studies (GWAS) considering SNP pairs with ge-

netic or statistical interaction, we calculated the possi-

ble computational savings in stepping down from all

possible SNP pairs to only those that are supported

by some kind of biological knowledge. Only protein-

protein interaction (PPI) and pathway membership

were considered.

SNPs were evaluated in two scenarios, one using

all known human reference SNPs present in the db-

SNP 138 database (232,952,851 million altogether)

(Sherry et al., 2001), while the other only evaluated

common SNPs (as deﬁned by the relevant UCSC Ta-

ble Browser Repeat and Variation table)(Karolchik

et al., 2004). 62,676,337 common SNPs are available

in the dbSNP 138 database (minor allele frequency of

common SNPs is > 1%). After selecting only those

SNPs that mapped to a RefSeq coding sequence in the

human genome, and removing duplicates, we counted

97,332 common SNPs and 1,590,290 general SNPs in

genes (Table 1).

3.1 Search Space Reduction After

Incorporating Biological Knowledge

3.1.1 Protein-protein Interactions

Protein-protein interactions provide many possibili-

ties for epistatic effects. Protein complexes may de-

pend on residue interactions that can sometimes ac-

cept compensatory mutations. The increased expres-

sion of one protein in a protein complex can lead to

various signals leading to increased expression of its

partners.

We counted 9419 interactions among 3033 pro-

teins in the protein-protein interaction dataset and

901659 interactions among 6513 proteins in the path-

way dataset. Each gene (coding sequences only) was

covered on average by 5.24 and 65 SNPs respectively

(Table 1). Using PPI data from DIP and MINT and

common SNPs, we were able to reduce the number

of SNP pairs to be analysed in a GWAS study only to

ThePossibilitiesofFilteringPairsofSNPsinGWASStudies-ExploratoryStudyonPublicProtein-interactionand

PathwayData

261

DIP, MINT Wikipathways

Interacting proteins

Pathway members

SNPs

Candidate SNP interaction (PPI-derived)

Candidate SNP interaction (pathway-derived)

SNP pairs to be filtered

Metabolic or signalling pathway

Figure 1: Relationship between different entities and types of data considered in this study. The red and green lines show

the small number of informative candidates for SNP interaction after ﬁltration by biological knowledge from external sources

(top). Grey lines represent pairs of SNPs that will not be analyzed in a downstream GWAS analysis. To reduce clutter some

grey lines were intentionally omitted.

0.56% of the maximal possible number of pairs. Sim-

ilar reduction after ﬁltering was obtained when con-

sidering all known SNPs.

3.1.2 Metabolic and Signalling Pathways

Pathways provide similar type of knowledge as

protein-protein interaction, but tend to form larger

network of genes/proteins. 6513 unique genes were

mapped to pathways. Their grouping was such as to

form 16622 pairwise interactions. For the purpose

of this study, any pair of proteins participating in the

same metablic pathway were deemed to potentially

interact. In a more realistic scenario, we could only

consider pairs of proteins that directly share a metabo-

lite or otherwise interact in the pathway.

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

262

Table 1: Numerical results of counting the processed interaction and pathway data at various stages and from different aspects.

Two sets of SNPs from dbSNP, ”SNP Common” (present in at least 1% of the population) and ”SNP All” were used in the

study. The last three columns show the ﬁltration effect in terms of % remaining SNP pairs after the procedure. Fields marked

with asterisk (*) were prohibitively expensive to calculate and were only estimated from SNP Common data.

SNP set

Biol.knowledge

Ref gene

SNPs in set

SNP/gene

Genes in int.

SNPs in int.

SNP/int.gene

SNP pairs [mil]

Filtered [mil]

% reduction

Common

PPI 18565 97332 5.24 3033 12149 4.005 73.8 0.42 0.56

Path 18565 97332 5.24 6513 16622 2.55 138.1 7.30 5.28

All

PPI 24502 1590290 65 3033 261349 86.2 34152 133.7 0.39

Path 24502 1590290 65 6513 357572

54.9

63929

2344.4

3.67

Because of the bigger size of pathway maps than

the PPI network (6513/901659 v. 3033/9419 in terms

of the number of genes/gene interactions), the ﬁltra-

tion using this criterion is bound to be less effective

and produce a higher number of potentially interact-

ing SNPs. Using pathway membership data from

Wikipathways, we were able to reduce the number of

SNP pairs to be analysed in a GWAS study only to

5.28% of the maximal possible number of pairs.

3.2 Incorporating Detected Pairs into

GWAS Workﬂows

While the database format was convenient for study

purposes, and while the resulting database can be eas-

ily queried for SNPs that are candidates for interac-

tion in GWAS studies, routine use of such calcula-

tions would probably beneﬁt from a custom-coded so-

lution, with dedicated data structures created to store

marker pairs or triples, perhaps as a library that could

be linked to a GWAS analysis program.

4 DISCUSSION

We have shown that using biological knowledge from

commonly accessible biological databases can help to

identify a small subset of all possible SNP pairs, thus

reducing the computational requirements of a GWAS

analysis aiming to study marker interactions and their

association with some phenotype. While identify-

ing interacting or otherwise related proteins in pairs

helped to reduce the number of pairs to evaluate to

0.4-5% of their original unﬁltered number, the effect

would be even more pronounced in case of triples

or quadruples. Filtration by biological knowledge is

deﬁnitely a viable option to prioritize SNPs prior to

analysis, as oposed to other methods prioritizing after

analysis.

It should be noted that we have not made any pro-

visions for separating SNPs that represent synony-

mous and non-synonymous mutations. Such analysis

or selection could lower the number of relevant SNPs,

resulting in further reduction in number of pairs that

must be evaluated.

It is now commonly accepted that epistasis should

be behind a signiﬁcant portion of the so-called ”lost

heritability”. Many recent works, including this pa-

per, regard methods of detecting multiple interacting

SNPs in whole-genome studies and processing them

in an efﬁcient manner. Because of the computational

complexity of evaluating k-tuples of SNPs for k >> 1,

it would be desireable to work with k as small as pos-

sible, but still be able to discover effects of larger net-

works of interacting SNPs. Liu et al. suggest such

networks could be reconstructed from detected SNP-

SNP interactions (Liu et al., 2012). However, the pair-

wise interactions could be plagued by ”phantom” in-

teractions caused by detecting indirect relationships

caused by the transitivity of interactions. Recently,

a solution to separating direct and indirect interac-

tions in networks occuring in other disciplines has

been proposed (Feizi et al., 2013). We suggest that

SNP interaction networks be reconstructed from pair-

wise data, as carried out by Liu et al.(2012) only after

the pairwise data is network-deconvoluted, resulting

in higher quality SNP networks showing only direct

interactions as edges.

ThePossibilitiesofFilteringPairsofSNPsinGWASStudies-ExploratoryStudyonPublicProtein-interactionand

PathwayData

263

5 CONCLUSIONS

In this study we explore ways to select appropriate

candidate SNP-SNP pairs for GWAS studies (for ana-

lyzing interacting SNPs), based on biological knowl-

edge. We also calculate the reduction in computa-

tional complexity that can be obtained after such pre-

ﬁltering step. As can be seen on the contrasting ex-

amples of direct PPI and pathway membership data,

the reduction achieved by ﬁltering is less signiﬁcant

for pathway data with a wider pathway membership

compared to the more restrictive pairwise interaction.

The difference in this speciﬁc example is 10-fold. The

ﬁltering would be even more selective in the case of

SNP triples or quadruples. This computational exer-

cise is discussed in the context of the problem of so-

called ”lost heredity” and the need to analyze possible

interactions between SNPs and their association with

certain phenotypes in GWAS analysis.

ACKNOWLEDGEMENTS

Financial support provided by the EU 7th Frame-

work Project ”THALAssaemia MOdular Stratiﬁ-

cation System for Personalized Therapy THALA-

MOSS” (FP7-HEALTH-2012-INNOVATION-1 Col-

laborative Project; http://thalamoss.eu/index.html).

REFERENCES

Bush, W., Dudek, S., and Ritchie, M. (2009). Bioﬁlter:

a knowledge-integration system for the multi-locus

analysis of genome-wide association studies. In Pa-

ciﬁc Symposium on Biocomputing, volume 14, pages

368–379.

Emily, M., Mailund, T., Hein, J., Schauser, L., and

Schierup, M. (2009). Using biological networks to

search for interacting loci in genome-wide association

studies. Eur J Hum Genet, 17:1231–1240.

Feizi, S., Marbach, D., Mdard, M., and Kellis, M. (2013).

Network deconvolution as a general method to dis-

tinguish direct dependencies in networks. Nature

Biotechnology, 31:726–733.

Hindorff, L., Sethupathy, P., Junkins, H., Ramos, E., Mehta,

J., Collins, F., and Manolio, T. (2009). Potential etio-

logic and functional implications of genome-wide as-

sociation loci for human diseases and traits. Proc Natl

Acad Sci USA, (May 27).

Hua, L., Lin, H., Li, D., Li, L., and Liu, Z. (2012). Min-

ing functional gene modules linked with rheumatoid

arthritis using a snp-snp network. Genomics, Pro-

teomics & Bioinformatics, 10(1):23–34.

Huh, I.-S., Sohee-Oh, and Park, T. (2011). A chi-square

test for detecting multiple joint genetic variants in

genome-wide association studies. In IEEE Interna-

tional Conference on Bioinformatics and Biomedicine

Workshop, pages 708–713.

Karolchik, D., Hinrichs, A., Furey, T., Roskin, K., Sugnet,

C., Haussler, D., and Kent, W. (2004). The ucsc ta-

ble browser data retrieval tool. Nucleic Acids Res.,

32:D493–D496.

Kelder, T., Van Iersel, M., Hanspers, K., Kutmon, M., Con-

klin, B., Evelo, C., and Pico, A. (2012). Wikipath-

ways: building research communities on biological

pathways. Nucleic Acids Res., 40:D1301–D1307.

Licata, L., Briganti, L., Peluso, D., Perfetto, L., Iannuccelli,

M., Galeota, E., Sacco, F., Palma, A., Nardozza, A.,

Santonico, E., Castagnoli, L., and Cesareni, G. (2012).

Mint, the molecular interaction database: 2012 up-

date. Nucleic Acids Res., 40:D857D861.

Liu, Y., Zhou, J., Liu, Z., Chen, L., and Ng, M. (2012). Con-

struction and analysis of genome-wide snp networks.

In IEEE 6th International Conference on Systems Bi-

ology (ISB).

Maher, B. (2008). Personal genomes: The case of the miss-

ing heritability. Nature, 456:18–21.

Manolio, T., Collins, F., and Cox, N. e. a. (2009). Finding

the missing heritability of complex diseases. Nature,

461(7265):747–753.

Mantel, N. and Haenzel, W. (1959). Statistical aspect of the

analysis of data from retrospective studies of disease.

J.Natl.Cancer Inst, 22:719–748.

Quinlan, A. and Hall, I. (2010). Bedtools: a ﬂexible suite

of utilities for comparing genomic features. Bioinfor-

matics, 26(6):841–842.

Salwinski, L., Miller, C., Smith, A., Pettit, F., Bowie, J.,

and Eisenberg, D. (2004). The database of interact-

ing proteins: 2004 update. Nucleic Acids Research,

32(90001):D449–D451.

Sherry, S., Ward, M., Kholodov, M., Baker, J., Phan, L.,

Smigielski, E., and Sirotkin, K. (2001). dbsnp: the

ncbi database of genetic variation. Nucleic Acids Res.,

29(1):308–311.

Slavin, T., Feng, T., Schnell, A., Zhu, X., and Elston, R.

(2011). Two-marker association tests yield new dis-

ease associations for coronary artery disease and hy-

pertension. Hum Genet, 130:725–733.

Thorisson, G., Lancaster, O., Free, R., Hastings, R.,

Sarmah, P., Dash, D., Brahmachari, S., and Brookes,

A. (2009). Hgvbaseg2p: a central genetic association

database. Nucleic Acids Research, 37:D797–802.

van Steen, K. (2011). Traveling the world of gene-gene

interactions. Brieﬁngs in Bioinformatics, 13(1):1–19.

Witte, J. (2010). Genome-wide association studies and be-

yond. Annu. Rev. Public Health, 31:9–20.

BIOINFORMATICS2014-InternationalConferenceonBioinformaticsModels,MethodsandAlgorithms

264