(residues 221-228) based on H3 numbering (Das et
al., 2009; Durand et al., 2015; Jiang et al., 2012;
Stevens et al., 2006). It is considered that the
mutations in the RBD could affect the receptor
binding avidity and specificity of hemagglutinin
(Chen et al., 2011; de Vries et al., 2013; de Vries et
al., 2014; Schrauwen and Fouchier, 2014). The RBD
is the primary target of neutralizing antibodies,
which are induced by virus infection or by
vaccination with specific antigen (Bright et al.,
2003; Chen et al., 2011; Jiang et al., 2012; Khurana
et al., 2011; McCullough et al. 2012). However, the
mutations in the RBD lead to change in viral
immunogenicity and antigenicity (Chen et al., 2011;
Xu et al., 2010). Jiang et al. (2012) state that RBD
plays a critical role in the elucidation of antiviral
immune response and protective immunity.
McCullough et al. (2012) also state that a better
understanding of mutations in the RBD may be
useful in vaccine and drug design effort. To prepare
the future emergence of potentially dangerous
outbreaks caused by divergent influenza strains
including human-adapted H5N1 strains, it is
imperative that we understand the rule stored in the
sequence of the RBD.
Information of life is stored as a code composed
of four nucleotides: adenine (A), cytosine (C),
guanine (G), and thymine (T). Therefore, we can
consider that the DNA or gene in each organism is a
code showing its inherent structure. In protein
coding region, each group of three consecutive
nucleotides is called a codon, and each codon
corresponds to one amino acid. The total number of
three nucleotide groups is the third power of 4,
which means we have 64 codons. However, only 20
proteinogenic amino acids exist in nature. Moreover,
it is supposed that the third nucleotide for a codon
will not play an essential role in making of an amino
acid. This shows that a gene has redundancy to
correct errors to some extent. In other words, it has a
structure that is similar to one of an error-
correcting/detecting code for the transmission of
information. In life-science research, it is important
to determine the code structure of the target gene.
Once we know the code structure, we can make use
of mathematical results concerning coding theory for
research in life science. How can the RBD
sequences of influenza A viruses be discussed using
coding theory? The present study was conducted to
find out the code structure of the 220 loop of
influenza A viruses, and to predict sequence changes
in the 220 loop of H5N1 virus.
2 METHODS
2.1 Sequence Data
We applied artificial codes in coding theory to
sequence analysis of the 220 loop in the H1, H3, H5
and H7 RBD. All full-length amino acid and
nucleotide sequences of hemaggulutinin from
influenza A H1, H3, H5, and H7 subtypes were
downloaded from the Influenza Research Database
on September 2014. The hemaggulutinin data set
consists of 8,941 human sequences from the H1
subtype between 1918 and 2014, 6,013 human
sequences from H3 subtype between 1968 and 2014,
230 human sequences from the H5 subtype between
1997 and 2013, and 51 human sequences from H7
subtype between 1996 and 2014. The sequences
were aligned using MAFFT (Katoh and Toh, 2008)
which can quickly process a large dataset.
2.2 Sequence Analysis of the 220 Loop
by Coding Theory
We explain how to encode the nucleotide sequence
of the 220 loop to detect the code structure. The
method for applying artificial codes to sequence
analysis has been described in detail previously
(Ohya and Sato, 2000; Sato et al., 2013)
. Since the
Galois Field GF(4) consists of four elements, 0, 1,
and
such that
++1=0 , the four
nucleotides can be expressed in
each of four elements.
There are a total of 24 (= 4!) different possible
combinations to map the four nucleotides to the four
elements in GF(4).
First, an important part of the nucleotide
sequence of the 220 loop from an influenza strain,
namely the nucleotide sequence excluding the third
nucleotide of each codon, is transformed into the
information sequence which consists of the elements
of GF(4). Next, the information sequence is grouped
into blocks and then encoded into code words of an
error-correcting/detecting code C. The total length of
such a code (code word length) is multiples of 3 and
the length of the information symbols (information
block length) is multiples of 2. The check symbols
in each code word are placed into the corresponding
position of the third nucleotide of codon. Then, the
encoded sequence, which consists of the set of the
code words, is written back to nucleotide sequence.
We call it the encoded nucleotide sequence. After
that, the encoded nucleotide sequence is converted
into amino acid sequence. We call it the encoded
amino acid sequence. Finally, the degree of
similarity between the amino acid sequence of the