Note: you may find references in this statement from my curriculum vitae.

As Steve Jobs said, "I think the biggest innovations of the 21st century will be at the intersection of biology and technology. A new era is beginning." Identification of human disease genes is one of major tasks of modern genomics. Current research has revealed thousands of disease risk variants, many signal pathways, greatly improving our understanding of roles of genetics in human diseases. However, a large portion of heritability is still unexplained. Powerful statistical methods are in demand to reduce the gap from current discoveries to tagging causal genes, dissecting the underlying disease etiology, and eventually translational research, clinical practice and personalized medicine.

My research focuses on statistical methods and applications to genetic epidemiology and bioinformatics. The goal is to utilize my statistical skills and my comprehensive background in both genetics to discover novel genetic risk factors to human diseases. I am dedicated in reducing the gap from current genetic discoveries to future research. My research topics cover multiple challenging issues, including (1) rare variants, (2) multiple phenotypes, (3) integrative analysis via genetic interaction networks, (4) joint analysis of familial and unrelated data to enlarge sample size, (5) identification of genomic signals as biomarkers, and (6) other susceptibility sources such as proteomics.

• Statistical genetics

1. Semi-parametric models for multiple phenotypes analysis in genetic association studies [6]
Studying multiple phenotypes from the same set of subjects can maximally gain the association mapping between genotypes
and phenotypes, in addition to economical consideration. However, case-control study does not constitute a random sample
from the general population. If a design ignored the case-control sampling, the association between genetic variant and
secondary phenotype can be biased. We propose a general approach for estimating and testing the population effect of a
genetic variant on a secondary phenotype, by correcting such a bias. Our approach is based on inversely-weighting estimating
equations using a conditional probability of an individual being a case or a control as the corresponding weight. Our model is
substantially more robust to model misspecification, and out-performs a likelihood-based analysis, both in terms of validity
and power. Our model is also much more computationally efficient. These advantages made our approach a practical tool for
genome-wide genetic association studies.

Next, we continue our previous studies and extend to analysis of multiple phenotypes using sequencing data. We continue to use inverse-weighted estimating equations with a conditional probability of an individual being the status of primary trait as the corresponding weight. However, when the primary trait is continuous, we need to induce a latent binary variable and connect it to the continuous primary trait. Unbiasedness holds with careful examination of model consistency and identifiability. Such a design can be important, for example, that hundred of phenotypes other than the selected ones for sequencing in CHARGE study can not be analyzed due to such biases, if we ignore the sampling of the selected traits. Re-sequencing hundreds of phenotypes is also not financially feasible.In addition, our design considers missing data, family structure, rare variants, meta-analysis, and pleiotropic and polygenic analysis. We will also consider the influence of different study designs especially the case-cohort design for CHARGE sequencing. We test our models using CHARGE sequencing data. User-friendly software will be released for the public community.

2. Nonparametric Bayes modeling of genetic interaction networks for integratively identifying sets of genetic risks of human diseases [1, 3, 5, 22, 24]

A better understanding of human complex diseases in the future relies on better dissecting the underlying genetic interaction
networks considering various interactions. It is a challenging problem to model interaction networks of high
dimensional data, particularly in cases in which one wishes to avoid strong assumptions about the dependence structure. We
develop a nonparametric Bayes approach that defines a prior with full support on the space of distributions. This support
condition ensures that we are not restricting the dependence structure a priori. We show this can be accomplished through a
Dirichlet process mixture of product multinomial distributions, which is also a convenient form for posterior computation.
We have extended this work into a series of explorations for integratively identifying sets of disease risks of genotypes,
environmental factors, disease traits and all other relevant risk factors.

Next, we aim to integratively identify sets of disease risk factors including genotypes, environmental factors, fat-intake variables, disease traits and all relevant factors. We extend from our model in [1] to consider mixed data types of discrete and continuous data, family structure, sequencing data, prior incorporation and efficient posterior calculation. Compared to previous models, our model produces genetic interaction networks without any restricted assumptions, while avoiding false positives. We incorporate discrete and continuous data through a Dirichlet process mixture of product multinomial distributions that models the joint distribution of all variables. From the joint distribution, the genetic interaction network can be learned. Via a latent variable labeling process, our model can optimally capture genetic association between the trait and the variables of interest. We incorporate family structure via a genetic random effect variable. We consider sequencing data using SKAT-based method and/or the approach. We apply our models to explore a network of genes that might interact with dietary fat intake variables to influence trabecular bone density of L3 vertebra measured using quantitative computed tomography. I have put this into a NIH R01
grant proposal.

3. Unified analyses of case-control & familial data using a novel Bayesian framework [4]
Several central issues continue to hamper efforts to improve the power of mapping trait loci in complex human disease.
Inadequate power can yield genetic associations that can be extremely inaccurate, unstable, and even meaningless. We
proposed a Bayesian framework that allows for a unified analysis of unrelated and family based data to address some of such
central issues including gene-environment interactions. Joint modeling increases the sample size and, therefore, promises the
improved power. We incorporate gene-gene and gene-environment interactions using our previous model in [1], considering
various data types for traits, population stratification and family structure. We can conveniently model discrete traits using
logistic and multinomial regression models, and continuous traits using normal and mixture of normal distributions. A
matched case-control design naturally handles population stratification inherent in population-based samples, while
correlation effects in family-based data are properly modeled. We will develop free software tools for the public research
community. We will test our methods on three Parkinson’s datasets: the Mayo (LEAPs) study, the NINDS PD study, and the
CIDR Familial PD study. The preliminary tests using a binary phenotype supported that our framework works efficiently. I
have put above plans into a R03 NIH grant proposal with me as the Principal Investigator. The review returns high scores
with impact factor of 29, with < 30 having a chance to get money.


4. An evolution-based approach to identify rare disease-risk variants avoids large sample size and low power [25]
We propose a novel model to identify rare risk variants in next generation sequencing data. Identification of rare disease-risk
variants is one of central tasks in analyzing sequencing data. Current statistical methods are based on assembling rare variants
for gaining higher statistical power, not necessarily reflecting the underlying biological mechanism. Alternatively, we
propose a model that measures conserved patterns such as splice sites instead of mutations such as SNPs. We then combine
the conserved patterns with identity by decedent (IBD) mapping segments in long range for identifying rare variants. This
assumes that rare variants happen recent generations and tend to have longer IBD segments. Because not all case subjects
include rare-risk variants due to its low occurrence, we can form a problem based on a small subset that include the rare
variants of interest. The conserved patterns and IBD mapping segments can be used as genetic units in association studies.
We expect to achieve to a reasonable power with relatively small sample size but higher frequency of risk variants.


5. Performance of statistical methods on CHARGE Targeted Sequencing data [10]
The CHARGE-S (Cohorts for Heart and Aging Research in Genomic Epidemiology-Sequencing) project is a national,
collaborative effort from 3 studies: Framingham Heart Study (FHS), Cardiovascular Health Study (CHS), and
Atherosclerosis Risk in Communities (ARIC). CHARGE-S used a case-cohort based design, whereby a random sample of
study participants is enriched by subjects in extremes of traits. We aim to evaluate the performance of statistical methods in a
case-cohort design of CHARGE targeted sequencing data. We provide some guidelines on the performance of statistical
methods to detect rare variants on CHARGE targeted sequencing data, based on the case-cohort study design used in
CHARGE sequencing.


6. Power analysis of Illumina Omni 5 in detecting novel susceptible variants [9]
Emerging sequencing technologies can virtually characterize all the variants with a spectrum of minor allele frequencies
(MAFs) ranging from 0 to 0.5. However, the cost is still rather high. Illumina recently released Omni 5 genotype array with ~
5 million assayed SNPs that can balance cost and array density. In this study, we investigate the power of Omni 5 in detecting
susceptible variants. We evaluate theoretical power and also the power of detecting susceptible variants in several studies of
FHS, femoral neck bone mineral density, lumbar spine bone mineral density and hippocampal volume. The observations
support that arrays with denser markers have stronger power in detecting susceptible variants.


• Bioinformatics

7. Nonparametric Bayes ensemble learning to integrate multiple sources for predicting protein-protein interactions and applications to human diseases[2].
The computational prediction of protein-protein interactions (PPIs) is an important task in post-genomic era. PPIs not only play important roles in fundamental cellular processes but also a major susceptibility source to human complex diseases and crucial in discovery of new drugs and pharmaceuticals. One of major difficulties for current prediction of PPIs is rather high false positives, which can be as high as >80%. No computational method was designed to effectively reduce false positives. We propose a novel Bayesian integration method, deemed nonparametric Bayes ensemble learning (NBEL), to lower the misclassification rate through automatically up-weighting data sources that are most informative, while down-weighting less informative and biased sources. Extensive studies indicate that NBEL is significantly more robust than the classic naïve Bayes and regression model to unreliable, error-prone and contaminated data. The validation on three experimental datasets with high quality supports our observations. Our method may speed up PPIs prediction with high quality. Such a reliable prediction may provide a solid platform to other studies such as prediction of protein functions and roles of PPIs in disease susceptibility. This work is published at PLoS Computational Biology and is reported by GeomeWeb (http://www.genomeweb.com/blog/week-plos-147).

Many future works are left for obtaining more reliable and dynamic inferences of PPIs. Particularly, information sharing exists in the different data sources. An effective integration method releasing the restriction of the conditional independence is therefore in demand. In addition, because the network of PPIs is essentially time evolving, an approach to modeling a network of PPIs dynamically is under plan by using time series modeling. We will test our models using human PPIs data from literature and the data that are collected by ourselves. A possible database may be built for newly discovered PPIs. We will work with our collaborators
to validate our new discoveries, and we are interested in applying them in detecting their risks to human diseases. The ability
to predict large number of PPIs both reliably and automatically may speed up PPIs prediction with high quality. The reliable
prediction from our method may benefit other studies such as protein functions prediction and roles of PPIs in disease
susceptibility.

8. Identification of genomic signals using free energy and signal processing approaches [12-16].
Genomic signals define genes and therefore translated proteins that are basic molecules regulating human health. We aim to
identify the critical genomic signals such as protein-coding genes, slice sites and pseudogenes. Instead of pure statistical
evaluation, we are dedicated in answering “why” and explore further by measuring the underlying interactions between the 3’
tail of 18S rRNA and mRNA using free energy. We discover a period-3, free energy signal in coding regions, which is not
found in non-coding regions. We further use this period-3 signal for identifying the protein-coding sequences by defining the
statistical features of period 3 signal. We test on the eukaryotic genes of Saccharomyces cerevisiae, Schizosaccharomyces
pombe, fly, mouse, and human. Our experiments indicate improved performance compared to other methods. The tests on
pseudogenes indicate that most pseudogenes have no period-3 signal. We also apply our approach for the identification of
splice sites systematically. Genomic signals discovered from our approach have wide applications, and can be used as biomarkers in revealing disease etiology and therefore personalized medicine.

• Collaborative Research
My collaborative research includes studies of the genetics of osteoporosis, stroke, diabetes, Parkinson’s disease, and sickle
cell disease. My collaborators are mostly clinical practitioners, epidemiologists, and geneticists mainly from Boston
University, Harvard University, and other local universities. The major statistical methods include regression models,
Bayesian statistics, semi-parametric methods, multivariate statistics and multiple testing using genotypes, microarray and
proteomic data. In addition to statistical consultation, I apply my methods and also develop new methods for my
. I give only a couple of examples here. One collaborator from Harvard University School of Medicine is
interested in fat intake by gene interaction on quantitative computed tomography derived bone density. Existing statistical
models are restricted to various independence assumptions for learning genetic interaction networks. The variables for
genetic interaction networks include genotypes, lumba-3 bone mineral density, fat intake variables, environmental factors and
covariate variables such as age and weight. I have developed a model using nonparametric Bayes method for categorical data
[1], but variables such as trait and weight are continuous data [3, 5]. To better resolve this issue, I propose a new method that
considers mixed discrete and continuous data. Another collaborator from Boston University School of Medicine is interested
in extracting genetic risk information from each of two datasets, genotypes and microarray data, about sickle cell disease and
then combining them together. We apply our recent publication [7] that provides an improved performance based on the
differential analysis of sets of genes for microarray data. I also supervise and provide guidance to biostatistical juniors
including Ph.D. students and analysts.

Please contact me if you would like to discuss with me more my research via my email: chuanhua at bu dot edu.