SystemsBio Lab

IPMiner is a webserver for identifying potential progenitor gene(s) for influenza A virus. IPMiner is developed based on genetic distance from Complete Composition Vector (CCV) and Minimal Spanning Tree (MST) algorithm. The standalone CCV is freely available HERE.

Influenza A Virus

Influenza A virus is a negative-stranded RNA virus that belongs to the Orthomyxoviridae family. Influenza A virus has 8 genomic segments (segment 1-8) with varying lengths from about 890 to 2,341 nucleotides. The subtypes of influenza A viruses are named by combining the serotypes of their surface protein hemagglutinin (HA) and neuraminidase (NA). To date, 16 HA (H1 through H16) and 9 NA (N1 through N9) serotypes have been identified (Fouchier et al., 2005). The H5N1 influenza A virus represents a virus having HA serotype H5 and NA serotype N1. Influenza A virus causes zoonotic diseases in many different hosts, for instance, human, pigs, birds, horses, seals, whales, and dogs. As a segmented, negative-stranded RNA virus, influenza A virus is notorious for its rapid mutation and frequent reassortment. A reassortment event refers to the exchange of gene segments between co-infected influenza viruses, and it has facilitated the emergence of 1957 H2N2, 1968 H3N2, and 2009 H1N1 pandemic strains. Identification of genetic origins of influenza A viruses will enhance our understanding of the evolution and adaptation mechanisms of influenza A viruses thus help influenza prevention and control.

Fouchier RA, Munster V, Wallensten A, Bestebroer TM, Herfst S, et al. (2005) Characterization of a novel influenza A virus hemagglutinin subtype (H16) obtained from black-headed gulls. Journal of Virology, 79, 2814-2822.

Influenza Progenitor Identification

The phylogenetic analysis is the traditional approach to identifying progenitors for influenza A virus. First, the nucleotide sequences must be aligned using multiple sequence alignment methods, such as Clustal W (Thompson, 1994), MUSCLE (Edgar, 2004), and T-COFFEE (Notredame, 1998). The strength of multiple alignment method lies in its ability to recognize the structural homogeneity among sequences. Multiple sequence alignments, however, have the limitation with a high computational cost, which is exponentially correlated with the number of input sequences. Second, phylogenetic analysis is performed on the aligned sequences to infer their evolutionary relationship using Neighborhood Joining, Maximum Parsimony, Maximum Likelihood, or Bayesian inference (see the list of phylogenetic programs on our resources page). Bootstrap analyses or computation of posterior probability are usually applied to estimate the phylogenetic uncertainty. Similar to multiple sequence alignments, phylogenetic inferences, especially the bootstrap and posterior probability estimation, are very time-consuming. Thus, it is difficult to perform an analysis using this traditional method against a large datasets, for instance, more than 1,000 taxa, as is the common case for influenza studies.

Alternatively, BLAST (Altschul, 1990) is applied to identify the homologous genes in the database. BLAST algorithms determine a similarity by identifying initial short matches and starting local alignments. Since influenza viral sequences have a very high similarity, especially for most conserved regions, BLAST usually generates a large number of outputs, which will not be helpful for progenitor identification. Since BLAST is a local sequence alignment, the results from BLAST do not reflect the global evolutionary information between the sequences, either.

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool, Journal of molecular biology, 215, 403-410.
Edgar, R.C. (2004) MUSCLE: multiple sequence alignment with high accuracy and high throughput, Nucleic acids research, 32, 1792-1797.
Notredame, C., Holm, L. and Higgins, D.G. (1998) COFFEE: an objective function for multiple sequence alignments, Bioinformatics (Oxford, England), 14, 407-422.
Thompson, J.D. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighing, position-specific gap penalties and weight matrix choice., Nucleic Acids Research, 22, 4673-4680.

CCV and MST

CCV was originally developed for phylogenetic tree construction for prokaryotic genomes (Wu et al. 2006) and has been adapted in influenza evolutionary analyses (Wan et al. 2007a) and genotypic analyses (Wan et al. 2007b). Recently, we have developed another new reassortment identification algorithm by integrating CCV and MST clustering algorithms (Wan et al. 2008). Our results showed that CCV-MST is very efficient in identifying the potential progenitor genes in influenza viruses. Based on this algorithm, IPMiner is developed specifically for identifying influenza progenitor gene.

Wan, X.F., Wu, X., Lin, G., Holton, S.B., Desmone, R.A., Shyu, C.R., Guan, Y. and Emch, M.E. (2007a) Computational identification of reassortments in avian influenza viruses, Avian Diseases, 51, 434-439.
Wan, X.F., Chen, G., Luo, F., Emch, M. and Donis, R. (2007b) A quantitative genotype algorithm reflecting H5N1 Avian influenza niches, Bioinformatics (Oxford, England), 23, 2368-2375.
Wan, X.F., Ozden, M., and G. Lin. (2008) Ubiquitous reassortments in influenza A viruses, Journal of Bioinformatics and Computational Biology, 6, 981-99.
Wu, X., Wan, X.-F., Wu, G., Xu, D., and G. Lin. (2006) Phylogenetic analysis using complete signature information of whole genomes and clustered neighbor-joining method. International Journal of Bioinformatics Research and Applications, 2, 219-248.

Acknowledgement

The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIAID or the NIH.