A Coding Theoretical Approach to Predict Sequence Changes in

H5N1 Influenza A Virus Hemagglutinin

Keiko Sato, Toshihide Hara and Masanori Ohya

Department of Information Science, Tokyo University of Science, Noda, Japan

Keywords: Receptor Binding Domain, H5N1, Code Structure, Sequence Prediction.

Abstract: The changes in the receptor binding domain of influenza A virus hemagglutinin lead to the appearance of

new viral strains that evade the immune system. To prepare the future emergence of potentially dangerous

outbreaks caused by divergent influenza strains including human-adapted H5N1 strains, it is imperative that

we understand the rule stored in the sequence of the receptor binding domain. Information of life is stored as

a sequence of nucleotides, and the sequence composed of four nucleotides seems to be a code. It is

important to determine the code structure of the sequences. Once we know the code structure, we can make

use of mathematical results concerning coding theory for research in life science. In this study, we applied

various codes in coding theory to sequence analysis of the 220 loop in the receptor binding domain of H1,

H3, H5 and H7 subtype viruses isolated from humans. Sequence diversity in the 220 loop has been observed

even within the same hemagglutinin subtype. However, we found that the code structure of the 220 loop

from the same subtype remains unchanged. Our results indicate that the sequences at the 220 loop have the

structure of subtype-specific codes. In addition, in view of these finding, we predicted possible amino acid

changes in the 220 loop of H5N1 strains that will emerge in the future. Our method will facilitate

understanding of the evolutionary patterns of influenza A viruses, and further help the development of new

antiviral drugs and vaccines.

1 INTRODUCTION

Influenza A viruses have eight pieces of segmented

RNA, which encode 11 proteins (Olsen et al., 2006).

The antigenic properties in the two viral surface

proteins, hemagglutinin and neuraminidase, are used

to classify influenza A viruses into different

subtypes. Currently Influenza A viruses circulating

among humans are the H1N1 and H3N2 subtypes.

Although other subtypes such as H5N1 and H7N9

have not yet gained the ability to spread efficiently

from person to person, these virus subtypes have

occasionally infected humans.

High-pathogenicity avian H5N1 influenza

viruses exhibiting high lethality continue to pose

threats to our lives since their emergence in China in

1996. According to the World Health Organization

(WHO), there have been 826 human cases with

H5N1 influenza infection since 2003, and

approximately 53% of the cases have died (as of

March 31, 2015). Despite the high mortality, H5N1

viruses have not yet gained the ability to spread

efficiently from person to person. However, the

outbreaks of H5N1 have been reported among

domestic poultry and wild birds in many countries

(Durand et al., 2015; Pfeiffer et al., 2011;

Yamamoto et al., 2011). In addition, recent studies

reported that a reassortant influenza virus containing

a hemagglutinin protein from an H5N1 virus with

four mutations can be transmitted between ferrets

(Imai et al., 2012). The viral surface protein,

hemagglutinin mediates binding of the virus to target

cells via the host cell receptor, sialic acid (Jiang et

al., 2012; Rumschlag-Booms and Rong, 2013). The

hemagglutinin of avian influenza viruses

preferentially binds sialic acid receptors (α2,3-SA)

on epithelial cells in the intestinal tract of birds and

in the lower respiratory tract of humans, whereas the

hemagglutinin of human influenza viruses

preferentially binds another type of sialic acid (α2,6-

SA) (Schrauwen and Fouchier, 2014; Yen and

Peiris, 2012). The receptor binding domain (RBD)

of hemagglutinin, situated at the outer surface on top

of the viral spike, is composed of three major

structural elements: a 130-loop (residues 134-138), a

190-helix (residues 188-190), and a 220-loop

Sato K., Hara T. and Ohya M.

A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Inﬂuenza A Virus Hemagglutinin.

DOI: 10.5220/0005659701590167

In Proceedings of the 9th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOINFORMATICS 2016), pages 159-167

ISBN: 978-989-758-170-0

159

(residues 221-228) based on H3 numbering (Das et

al., 2009; Durand et al., 2015; Jiang et al., 2012;

Stevens et al., 2006). It is considered that the

mutations in the RBD could affect the receptor

binding avidity and specificity of hemagglutinin

(Chen et al., 2011; de Vries et al., 2013; de Vries et

al., 2014; Schrauwen and Fouchier, 2014). The RBD

is the primary target of neutralizing antibodies,

which are induced by virus infection or by

vaccination with specific antigen (Bright et al.,

2003; Chen et al., 2011; Jiang et al., 2012; Khurana

et al., 2011; McCullough et al. 2012). However, the

mutations in the RBD lead to change in viral

immunogenicity and antigenicity (Chen et al., 2011;

Xu et al., 2010). Jiang et al. (2012) state that RBD

plays a critical role in the elucidation of antiviral

immune response and protective immunity.

McCullough et al. (2012) also state that a better

understanding of mutations in the RBD may be

useful in vaccine and drug design effort. To prepare

the future emergence of potentially dangerous

outbreaks caused by divergent influenza strains

including human-adapted H5N1 strains, it is

imperative that we understand the rule stored in the

sequence of the RBD.

Information of life is stored as a code composed

of four nucleotides: adenine (A), cytosine (C),

guanine (G), and thymine (T). Therefore, we can

consider that the DNA or gene in each organism is a

code showing its inherent structure. In protein

coding region, each group of three consecutive

nucleotides is called a codon, and each codon

corresponds to one amino acid. The total number of

three nucleotide groups is the third power of 4,

which means we have 64 codons. However, only 20

proteinogenic amino acids exist in nature. Moreover,

it is supposed that the third nucleotide for a codon

will not play an essential role in making of an amino

acid. This shows that a gene has redundancy to

correct errors to some extent. In other words, it has a

structure that is similar to one of an error-

correcting/detecting code for the transmission of

information. In life-science research, it is important

to determine the code structure of the target gene.

Once we know the code structure, we can make use

of mathematical results concerning coding theory for

research in life science. How can the RBD

sequences of influenza A viruses be discussed using

coding theory? The present study was conducted to

find out the code structure of the 220 loop of

influenza A viruses, and to predict sequence changes

in the 220 loop of H5N1 virus.

2 METHODS

2.1 Sequence Data

We applied artificial codes in coding theory to

sequence analysis of the 220 loop in the H1, H3, H5

and H7 RBD. All full-length amino acid and

nucleotide sequences of hemaggulutinin from

influenza A H1, H3, H5, and H7 subtypes were

downloaded from the Influenza Research Database

on September 2014. The hemaggulutinin data set

consists of 8,941 human sequences from the H1

subtype between 1918 and 2014, 6,013 human

sequences from H3 subtype between 1968 and 2014,

230 human sequences from the H5 subtype between

1997 and 2013, and 51 human sequences from H7

subtype between 1996 and 2014. The sequences

were aligned using MAFFT (Katoh and Toh, 2008)

which can quickly process a large dataset.

2.2 Sequence Analysis of the 220 Loop

by Coding Theory

We explain how to encode the nucleotide sequence

of the 220 loop to detect the code structure. The

method for applying artificial codes to sequence

analysis has been described in detail previously

(Ohya and Sato, 2000; Sato et al., 2013)

. Since the

Galois Field GF(4) consists of four elements, 0, 1, 

and 



such that 



++1=0 , the four

nucleotides can be expressed in

each of four elements.

There are a total of 24 (= 4!) different possible

combinations to map the four nucleotides to the four

elements in GF(4).

First, an important part of the nucleotide

sequence of the 220 loop from an influenza strain,

namely the nucleotide sequence excluding the third

nucleotide of each codon, is transformed into the

information sequence which consists of the elements

of GF(4). Next, the information sequence is grouped

into blocks and then encoded into code words of an

error-correcting/detecting code C. The total length of

such a code (code word length) is multiples of 3 and

the length of the information symbols (information

block length) is multiples of 2. The check symbols

in each code word are placed into the corresponding

position of the third nucleotide of codon. Then, the

encoded sequence, which consists of the set of the

code words, is written back to nucleotide sequence.

We call it the encoded nucleotide sequence. After

that, the encoded nucleotide sequence is converted

into amino acid sequence. We call it the encoded

amino acid sequence. Finally, the degree of

similarity between the amino acid sequence of the

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

160

220 loop from the influenza strain and the encoded

amino acid sequence described above is computed.

We think that if the amino acid sequence of the 220

loop is identical to the encoded amino acid sequence

generated by the code C, i.e. the similarity is 100%,

then the nucleotide sequence of the 220 loop has the

structure of the code C. Therefore, it is possible to

find the code structure of the 220 loop by computing

the degree of similarity for various artificial codes.

Artificial codes used for our study are the so-called

linear codes, cyclic codes, Bose-Chaudhuri-

Hocquenghem (BCH) codes, self-orthogonal codes

and Iwadare codes. Practically, we used 95 types of

codes including differences in generator polynomial.

Let 



( = 1, 2, ⋯ , 230) be 230 amino acid

sequences of the 220 loop from the H5 subtype. As

described above, we encode the 230 nucleotide

sequences of the 220 loop in a code C, and then get

the encoded amino acid sequences 





( =

1, 2, ⋯ , 230). Because the 220 loop is composed of

8 amino acid residues, a degree to measure the

similarity between 



and 





is denoted by rate of

coincidence (RC) as follows:

RC(



,





) = 1 −  8

(

0≤RC(



,





)≤1

)⁄

where  is the numbers of sites for which two amino

acid sequences differ from each other. RC(



,





)=1

means that the similarity between 



and 





is 100%.

If all of the 230 amino acid sequences of the 220

loop from H5 subtype are identical to the encoded

amino acid sequences generated by the code C, i.e.

∑

RC(



,





)





230

⁄

=1, then 100% of the 220

loop nucleotide sequences have the structure of the

code C.

(A) (24, 16) cyclic code with



(



)

=



+



+



+







+



+



+



+



+



(B) (12, 8) cyclic code with 

(



)

=



+

subtype subtype

D) Self-orthogonal code of R=2/3 with

G_3^1 (D)=D^2+1,G_3^2 (D)=D+1

subtype subtype

Figure 1: Percentage of the 220 loop nucleotide sequences with the structure of the indicated codes for H1, H3, H5, and H7

subtypes. The percentage was calculated to the second decimal place.  is the generator polynomial of each code. For

figures (A) and (B), the correspondence between the four nucleotides and the elements in GF(4) was given as A → 0, C

→1, T →  and G → 



. For figures (C) and (D), C → 0, A →1, G →  and T → 



A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Inﬂuenza A Virus Hemagglutinin

161

By using 95 types of codes for each case of the

24 representations of the four nucleotides in the

elements of GF(4), we tried to find the code

structure of the 220 loop in each of the H1, H3, H5

and H7 subtype viruses in this way.

Once we found the code structure for the 200

loop of influenza A virus by using various artificial

codes, we can apply this results to the prediction of

amino acid residues in the 220 loop of influenza

strains that will emerge in the future. The 220 loop is

composed of 8 amino acid residues (24 nucleotides).

The 220 loop of the H5N1 viruses isolated since

1997 showed nucleotide changes in 5 positions (the

first at codons 221, 222 and 223, and the second at

codons 226 and 227) out of 16 positions excluding

the third nucleotide position from each of the 8

codons. Therefore we consider the 5 positions as

variable positions, while the remaining 11 positions

as no variable positions. Given the possibility of any

one of four nucleotides at each of the 5 positions, the

information sequence is composed of 16 nucleotides

as follows, where N stands for any one of four

nucleotides: NCNANTAAGGCNANGG. In other

words, as for the information sequence of length 16,

1,024 (= 4

× 1

) patterns are made through

combination of these 16 positions. To predict

possible amino acid changes in the 220 loop of

H5N1 influenza hemagglutinin, each of these

information sequences was encoded using the

encoding scheme of the code characterizing the 220

loop sequences from H5N1 viruses.

3 RESULTS

3.1 The Code Structure of the 220

Loop of Influenza A Viruses

Figure 1 shows the percentage of the 220 loop

nucleotide sequences with the structure of the

indicated codes for their respective subtypes.

Interestingly, more than 95% (8,504/8,941) of the

220 loop nucleotide sequences of the H1 subtype

that infected humans between 1918 and 2014 had

the structure of the (24, 16) cyclic code with the

generator polynomial 

(



)

=



+



+











+



+



+



+



+



(Figure 1(A)).

Almost all of the 220 loop nucleotide sequences

from other subtypes (H3, H5 and H7) did not have

that structure. For the H3 subtype that infected

humans between 1968 and 2014, more than 99%

(5,956/6,013) of the 220 loop nucleotide sequences

had the structure of the (12, 8) cyclic code with the

generator polynomial 

(



)

=



+1 (Figure 1(B)).

Table 1: Possible amino acid changes in the 220 loop of

H5N1 influenza A strains that will emerge in the future.

Residues 221-228



TEMNGQNG SEVKGLNG PKLNGQNG

TEMNGQSG SEVKGLTG PKLNGQSG

TEMNGRTG SEVNGRNG PKLKGLIG

TEMKGPIG SEVNGRSG PKLKGLTG

TQINGQNG SEVKGPSG PKLNGRNG

TQINGQSG SEVKGPTG PKLNGRSG

TQIKGLIG SELNGQNG PKLKGPIG

TQIKGLTG SELNGQSG PKLKGPTG

TQIKGPNG SELKGLTG PKVNGQNG

TQIKGPIG SELNGRNG PKVNGQSG

TQIKGPTG SELNGRSG PKVKGLIG

TQVKGLIG SELKGPTG PKVKGLTG

TQVKGLTG SQINGHIG PKVNGRNG

TQVKGPIG AKINGQNG PKVNGRSG

TQVKGPTG AKINGQSG PKVKGPIG

TQLKGLIG AEMNGHIG PKVKGPTG

TQLKGLTG AEMNGHTG PEMNGQNG

TQLKGPIG AEMKGLNG PEMNGHIG

TQLKGPTG AEMKGLIG PEMNGQSG

SKINGQNG AEMKGLSG PEMNGHTG

SKINGQSG AEMNGRIG PEMKGLNG

SKIKGLTG AEMNGRTG PEMKGLSG

SKINGRNG AEMKGPIG PEMNGRNG

SKINGRSG AEMKGPTG PEMNGRSG

SKIKGPTG AQVKGLIG PEMKGPNG

SKVNGQNG AQVKGLTG PEMKGPSG

SKVNGQSG AQVKGPIG PEVNGQNG

SKVKGLSG AQVKGPTG PEVNGQSG

SKVKGLTG AQLKGLIG PEVKGLIG

SKVNGRNG AQLKGLTG PEVKGLTG

SKVNGRSG AQLKGPIG PEVNGRNG

SKVKGPTG AQLKGPTG PEVNGRSG

SKLNGQNG PKINGQNG PEVKGPIG

SKLNGQSG PKINGHIG PEVKGPTG

SKLKGLTG PKINGQSG PELNGQNG

SKLNGRNG PKIKGLNG PELNGQSG

SKLNGRSG PKIKGLIG PELKGLIG

SKLKGPTG PKIKGLSG PELKGLTG

SEMNGQNG PKIKGLTG PELNGRNG

SEMNGQSG PKINGRNG PELNGRSG

SEMKGPNG PKINGRSG PELKGPIG

SEVNGQNG PKIKGPIG PELKGPTG

SEVNGQSG PKIKGPTG

Those from other subtypes (H1, H5 and H7) did

not have that structure. In addition, we found the

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

162

code structure characterizing the 220 loop sequences

from the H5 and H7 subtypes, respectively. All

(230/230) of the nucleotide sequences of the H5

subtype that infected humans between 1997 and

2013 had the structure of the (24, 16) cyclic code

with the generator polynomial 

(



)

=



(Figure 1(C)). For the H7 subtype that infected

humans between 1996 and 2014, approximately 98%

(50/51) of the nucleotide sequences had the structure

of the self-orthogonal code of information rate

R=2/3 with the generator polynomial 





(



)

=



1, 





(



)

=+1 (Figure 1(D)). The amino acid

sequences of the 220 loop are diverse even within

the same subtype (Tables S1-S4). However,

surprisingly, the code structure of the 220 loop from

the same subtype remains unchanged.

3.2 Future Sequence Changes in H5N1

220 Loop

We found the mutation rules for the 200 loop of

influenza A virus hemagglutinin by using various

artificial codes in information transmission. As

became clear above, the 220loop human sequences

from H5N1 strains have preserved the structure of a

specific code since the emergence of H5N1 in

humans in 1997. In this study of predicting

sequences, we used 95 types of codes including

differences in generator polynomials on the

condition that C, A, G and T of nucleotides

correspond to 0, 1,  and 



of Galois Field GF(4),

respectively. Every 220 loop amino acid sequence

belonging to the H5 subtype was identical to the

encoded amino acid sequence generated by the (24,

16) cyclic code with generator polynomial 

(



)





+1 (the similarity is 100%) and was not identical

to that generated by any of different 65 types of

codes (the similarity is 0%).

Table 1 shows possible amino acid changes in

the 220 loop of H5N1 influenza strains that will

emerge in the future. These are composed of 128

sequences out of the 1,024 encoded amino acid

sequences generated by the (24, 16) cyclic code with

generator polynomial 

(



)

=



+1, the rest of

which were removed because of overlap with the

encoded amino acid sequences generated by the 65

types of codes. The possible changes we predicted

are based on the assumption that although sequence

diversity in the 220 loop of H5N1 hemagglutinin

will be observed even from now on, the code

structure will probably not change.

4 CONCLUSIONS

Influenza A H1 and H3 subtypes, which have

circulated among humans for nearly 100 years since

the pandemic of 1918 and for nearly 50 years since

the pandemic of 1968 respectively, continue to

change by accumulation of mutations in the

hemagglutinin. Similarly, other subtypes such as H5

and H7, which have occasionally caused human

infections, change by mutations in the

hemagglutinin. These changes, particularly the

changes in the RBD of the hemagglutinin, lead to

the appearance of new viral strains that evade the

immune system. Therefore, it is imperative for us to

understand the mutational patterns in the RBD.

Sequence diversity in the 220 loop of the RBD, has

been observed among different hemagglutinin

subtypes, or even within the same subtype.

However, the code structure of the 220 loop from

the same subtype remains unchanged. Our results

indicate that the sequences at the 220 loop have the

structure of subtype-specific codes. The first goal of

this study was to find out the code structure of the

220 loop of influenza A viruses. We fortunately

found the rules of mutations for the loop by using

various codes in information transmission. These

findings may be very helpful in predicting sequence

changes in the 220 loop and may provide clues to

the decision of vaccine strain and the development

of new antiviral drugs. The 220 loop of the RBD is

definitely an attractive target for developing antiviral

drugs.

The second goal of this study was to predict

sequence changes in the 220 loop of H5N1 virus.

Based on the assumption that the code structure of

the 220 loop from the same subtype will probably

not change even from now on, we predicted possible

amino acid changes in the 220 loop of H5N1

influenza strains that will emerge in the future. We

cannot deny the possibility that a pandemic H5N1

strain transmissible between humans may not

possess the amino acid changes predicted here.

Monitoring the molecular changes in hemagglutinin

is important for the accurate sequence prediction.

However, our method, which determines the code

structure of the 220 loop of influenza A virus

hemagglutinin, will facilitate understanding of the

evolutionary patterns of influenza A viruses, and

further help the development of new antiviral drugs

and vaccines. Through the generation of mutant

viruses possessing hemagglutinin gene with

mutations of the 220 loop predicted in our method

and the examination of the growth and

transmissibility of the mutant viruses in animal

A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Inﬂuenza A Virus Hemagglutinin

163

models, suitable vaccine candidates will be selected.

It is expected that the 220 loop-based influenza

vaccines would be effective against divergent

influenza strains, including those that may cause

pandemics in the future.

REFERENCES

Bright, R.-A. et al., 2003, Impact of glycosylation on the

immunogenicity of a DNA-based influenza H5 HA

vaccine. Virology, 308, 270-278.

Chen, M.-W. et al., 2011, Broadly neutralizing DNA

vaccine with specific mutation alters the antigenicity

and sugar-binding activities of influenza

hemagglutinin, Proceedings of the National Academy

of Sciences, 108, 3510–3515.

Das, P. et al., 2009, Free energy simulations reveal a

double mutant avian H5N1 virus hemagglutinin with

altered receptor binding specificity, J. Comput. Chem.,

30, 1654–1663.

de Vries, R.-P. et al., 2013, Evolution of the

hemagglutinin protein of the new pandemic H1N1

influenza virus: maintaining optimal receptor binding

by compensatory substitutions. J. Virol., 87, 13868-

13877.

de Vries, R.-P. et al., 2014, Hemagglutinin receptor

specificity and Structural Analyses of Respiratory

Droplet-transmissible H5N1 Viruses. J. Virol., 88,

768-773.

Durand, L.-O. et al., 2015, Timing of Influenza A(H5N1)

in Poultry and Humans and Seasonal Influenza

Activity Worldwide, 2004-2013, Emerg. Infect. Dis.,

21, 202-208.

Imai, M. et al., 2012, Experimental adaptation of an

influenza H5 HA confers respiratory droplet

transmission to a reassortant H5 HA/H1N1 virus in

ferrets, Nature, 486, 420-428.

Jiang, S. et al., 2012, Receptor-binding domains of spike

proteins of emerging or re-emerging viruses as targets

for development of antiviral vaccines, Emerging

Microbes & Infections, 1, e13.

Katoh, K. and Toh, H., 2008. Recent developments in the

MAFFT multiple sequence alignment program, Brief.

Bioinform., 9, 286–298.

Khurana, S. et al., 2011, Bacterial HA1 vaccine against

pandemic H5N1 influenza virus: evidence of

oligomerization, hemagglutination, and cross-

protective immunity in ferrets, J. Virol., 85, 1246–

1256.

McCullough, C. et al., 2012, Characterization of influenza

hemagglutinin interactions with receptor by NMR,

PloS one, 7, e33958.

Ohya, M. and Sato, K., 2000, Use of information theory to

study genome sequences. Reports on Mathematical

Physics, 46, 419–428.

Olsen, B. et al., 2006, Global patterns of influenza A virus

in wild birds. Science, 312, 384–388.

Pfeiffer, D.-U. et al., 2011, Implications of global and

regional patterns of highly pathogenic avian influenza

virus H5N1 clades for risk management, Vet. J., 190,

309-316.

Rumschlag-Booms, E. and Rong, L., 2013, Influenza a

virus entry: implications in virulence and future

therapeutics, Advances in virology, 2013, 121924.

Sato, K. et al., 2013, The code structure of the p53 DNA-

binding domain and the prognosis of breast cancer

patients. Bioinformatics, 29, 2822–2825.

Schrauwen, E.-J., 2014, Host adaptation and transmission

of influenza A viruses in mammals. Emerg. Microbes

Infect., 3, e9.

Stevens, J. et al., 2006, Structure and receptor specificity

of the hemagglutinin from an H5N1 influenza.

Science, 312, 404-410.

Xu, Q. et al., 2010, Influenza H1N1 A/Solomon

Island/3/06 virus receptor binding specificity

correlates with virus pathogenicity, antigenicity, and

immunogenicity in ferrets. J. Virol., 84, 4936-4945.

Yamamoto, N. et al., 2011, Characterization of a non-

pathogenic H5N1 influenza virus isolated from a

migratory duck flying from Siberia in Hokkaido,

Japan, in October 2009. Virol. J., 8, 65.

Yen, H.-L. and Peiris, J.-S., 2012, Virology: bird flu in

mammals, Nature, 486, 332-333.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

164

APPENDIX

Table S1: Amino acid sequence diversity in the 220 loop of human H1 hemagglutinin.

H1 strain Residues 221-228

AF117241|A/South_Carolina/1/18|H1N1 PKVRDQAG

CY010788|A/WSN/1933_TS61|H1N1 PKVKDQHG

U08904|A/WS/1933|H1N1 PKVRDQPG

DQ508905|A/Wilson_Smith/1933|H1N1 PKVRDQHG

U08903|A/NWS/1933|H1N1 PKVRNQPG

CY040170|A/Puerto_Rico/8_SV14/1934|H1N1 PKVKGQAG

CY146857|A/Puerto_Rico/8_SV40/1934|H1N1 PKVKDQAG

CY147326|A/BH/JY2/1935|H1N1 PKVRDQTG

CY020445|A/Henry/1936|H1N1 PEVRDQAG

CY013271|A/Hickox/1940|H1N1 PKVRGQAG

CY045772|A/Melbourne/1/1946|H1N1 PEVKDQAG

CY077768|A/Netherlands/002P1/1951|H1N1 PKVRNQAG

CY009340|A/Malaysia/54|H1N1 PKVRGQPG

CY008988|A/Denver/1/1957|H1N1 PKVRDQSG

CY125862|A/Kw/1/1957|H1N1 PKVRGQSG

CY021717|A/California/10/1978|H1N1 PKVRGQEG

CY028724|A/California/45/1978|H1N1 PKVRDQEG

CY020173|A/Lackland/7/1978|H1N1 PKVRDQKG

CY017203|A/Memphis/23/1983|H1N1 PKVRNQEG

CY104862|A/TayNguyen/TN182/2006|H1N1 PKVRDQGG

EU100724|A/Solomon_Islands/03/2006|H1N1 PKVRDREG

EU199338|A/Texas/06/2007|H1N1 PKVRBQEG

CY027779|A/Kentucky/UR06_0339/2007|H1N1 PKVREQEG

CY118091|A/Malaysia/1794173/2007|H1N1 LKVRDQEG

EU516017|A/Hawaii/31/2007|H1N1 PKIRDQEG

CY073960|A/Mexico/UASLP_009/2008|H1N1 PKLRDQDG

GU367325|A/Novgorod/01/2009|H1N1 PKVREREG

CY049076|A/Singapore/ON141/2009|H1N1 PKVGDQEG

CY051455|A/Wisconsin/629_S0339/2009|H1N1 TKVRDQEG

CY095906|A/Zhejiang/8/2009|H1N1 PKVRDQER

CY054606|A/Thailand/THB0405/2009|H1N1 PRVRDQEG

CY122835|A/Singapore/GP2242/2009|H1N1 PQVRDQEG

CY075897|A/Blore/NIV1196/2009|H1N1 PKMRGKEG

KC781609|A/California/33/2009|H1N1 PKMRDQEG

CY083399|A/Great_Lakes/WRAIR1664P/2009|H1N1 PKVKEQEG

KC782207|A/South_Carolina/18/2009|H1N1 PKVKDQEG

KC781375|A/Oregon/35/2009|H1N1 HKVRDQEG

CY095955|A/Zhejiang/86/2009|H1N1 PKVRDQEA

CY057254|A/New_York/5186/2009|H1N1 PKVMDQEG

CY069114|A/Madrid/INS296/2009|H1N1 PKVRAQEG

HM581919|A/Iran/15583/2009|H1N1 PKVRDRQG

KF411180|A/Qingdao/FF85/2009|H1N1 PKVRDSEG

CY067632|A/Qingdao/66/2010|H1N1 PKVRDQEW

CY092952|A/Chile/15/2010|H1N1 PKLRDQEG

CY079544|A/Switzerland/5165/2010|H1N1 PKVREQAG

JQ796827|A/Zhejiang/HZ19/2011|H1N1 PIVRDQEG

JQ396238|A/Kenya/145/2011|H1N1 PKGRDQEG

KF451900|A/Kenya/262/2013|H1N1 PKVKEQDG

KM013710|A/Shiraz/87/2013|H1N1 PKVRDHEG

KJ645782|A/Gainesville/03/2014|H1N1 PKVRSQEG

All groups of identical sequences in the 220 loop sequences from H1 subtype that infected humans between 1918 and 2014 were

represented by the oldest sequence in the group.

A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Inﬂuenza A Virus Hemagglutinin

165

Table S2: Amino acid sequence diversity in the 220 loop of human H3 hemagglutinin.

H3 strain Residues 221-228

CY011120|A/Northern_Territory/60/1968|H3N2 PWVRGLSS

V01103|A/NT/60/68/29c|H3N2 PWVRGQSS

AB284320|A/Aichi/2/1968|H3N2 PWVGGLSS

CY033529|A/Hong_Kong/1_9_MA21_3/1968|H3N2 PWIRGLSS

CY112249|A/Hong_Kong/1/1968|H3N2 PWVRGMSS

CY112297|A/Bilthoven/6022/1972|H3N2 PWVRGPSS

CY113957|A/Akita/4/1993|H3N2 PWVRGQPS

CY113981|A/Lyon/672/1993|H3N2 PWVRGLPS

CY114149|A/Hong_Kong/56/1994|H3N2 PWVRGISS

CY118426|A/Malaysia/07831/1995|H3N2 PWVRGVSS

CY009676|A/New_York/576/1997|H3N2 PWIRGVSS

CY121424|A/California/32/1999|H3N2 HWVRGVSS

CY001397|A/New_York/156/2000|H3N2 PWERGVSS

EU856922|A/Hong_Kong/CUHK22072/2000|H3N2 PWVRDVSS

EU856918|A/Hong_Kong/CUHK21932/2001|H3N2 PWIRDVSS

EU856946|A/Hong_Kong/CUHK24749/2001|H3N2 PRVRDVSS

DQ415319|A/TW/872/02|H3N2 HRVRDVSS

CY112933|A/Fujian/411/2002|H3N2 PRVRGVSS

CY003096|A/New_York/403/2002|H3N2 PWGRGVSS

CY007843|A/Canterbury/14/2002|H3N2 PWARGVSS

EU857019|A/Hong_Kong/CUHK50200/2002|H3N2 PRIRDVSS

EU103747|A/Denmark/87/2003|H3N2 PRVRDVPS

EF568926|A/Thailand/Siriraj_02/2003|H3N2 PRVRDIPS

AY531033|A/Wyoming/3/03|H3N2 PRVRDISS

EU857094|A/Hong_Kong/CUHK83422/2003|H3N2 LRVRDVPS

DQ249261|A/Taiwan/30005/2004|H3N2 PRVRHIPS

CY105310|A/HaNoi/HN30147/2004|H3N2 TRVRDVPS

CY013517|A/Wellington/58/2004|H3N2 SRVRDIPS

CY002064|A/New_York/392/2004|H3N2 PRIRDVPS

CY163648|A/Wisconsin/67/2005|H3N2 PRIRNIPS

EU283414|A/Hiroshima/52/2005|H3N2 PRVRNIPS

CY016595|A/South_Australia/18/2005|H3N2 LRVRDIPS

CY016028|A/Western_Australia/74/2005|H3N2 PRIRDIPS

KJ855363|A/Mexico/DIF29/2006|H3N2 LRVRNIPS

CY020357|A/New_York/923/2006|H3N2 PRVRBIPS

EU716471|A/Texas/03/2008|H3N2 HRVRNIPS

CY037543|A/California/UR07_0053/2008|H3N2 PRIKNIPS

FJ179354|A/Minnesota/14/2008|H3N2 PKVRNIPS

GQ385889|A/New_Hampshire/01/2009|H3N2 PRVREIPS

CY050125|A/Qingdao/1329/2009|H3N2 PRVGNIPS

CY091837|A/Guangdong/322/2010|H3N2 TRVRNIPS

JX946754|A/Qingdao/FF184/2010|H3N2 PRLRNIPS

KC882891|A/District_Of_Columbia/02/2010|H3N2 ARVRNIPS

KC882953|A/Minnesota/04/2011|H3N2 SRVRNIPS

CY162984|A/Peru/PER345/2011|H3N2 PRVRNVPS

KC892741|A/New_Jersey/08/2011|H3N2 LRIRNIPS

KC892638|A/California/34/2011|H3N2 PRIRBIPS

KJ942608|A/Hawaii/22/2012|H3N2 PRIRNSPS

KF598718|A/British_Columbia/004/2012|H3N2 HRIRNIPS

KC892959|A/Hawaii/02/2012|H3N2 TRIRNIPS

CY134996|A/Texas/JMM_37/2012|H3N2 PRIRNVPS

KF789696|A/Maine/05/2012|H3N2 PRIRNNPS

KF790228|A/Hawaii/30/2012|H3N2 SRIRNIPS

CY141264|A/Texas/3249/2013|H3N2 PRIRSIPS

KF789872|A/Hawaii/02/2013|H3N2 LRIRDIPS

KM064043|A/Texas/14/2014|H3N2 HRIRDIPS

All groups of identical sequences in the 220 loop sequences from H3 subtype that infected humans between 1968 and 2014 were

represented by the oldest sequence in the group.

BIOINFORMATICS 2016 - 7th International Conference on Bioinformatics Models, Methods and Algorithms

166

Table S3: Amino acid sequence diversity in the 220 loop of human H5 hemagglutinin.

H5 strain Residues 221-228

GU052142|A/Hong_Kong/485/1997|H5N1 PKVNGQSG

GU052089|A/Hong_Kong/378.1/2001|H5N1 SKVNGQSG

AB212054|A/Hong_Kong/213/2003|H5N1 SKVNGQNG

EF107522|A/Thailand/1_KAN_1A_/2004|H5N1 SEVNGQSG

EF456802|A/Viet_Nam/JPHN30321/2005|H5N1 SKINGQSG

DQ371929|A/Anhui/2/2005|H5N1 SKVNGRSG

KF918470|A/Cambodia/X0810301/2013|H5N1 SKVKGLSG

All groups of identical sequences in the 220 loop sequences from H5 subtype that infected humans between 1997 and 2013 were

represented by the oldest sequence in the group.

Table S4: Amino acid sequence diversity in the 220 loop of human H7 hemagglutinin.

H7 strain Residues 221-228

GU053110|A/England/AV877/1996|H7N7 PQVNGQSG

CY181569|A/Anhui/DEWH72_08/2013|H7N9 PQVNGLSG

KF018039|A/Taiwan/1/2013|H7N9 PQVNGPSG

KC853766|A/Hangzhou/1/2013|H7N9 PQVNGISG

KF609511|A/Shanghai/JS01/2013|H7N9 TQVNGQSG

All groups of identical sequences in the 220 loop sequences from H7 subtype that infected humans between 1996 and 2014 were

represented by the oldest sequence in the group.

A Coding Theoretical Approach to Predict Sequence Changes in H5N1 Inﬂuenza A Virus Hemagglutinin

167