|
Note: you may find references in this statement from
my curriculum vitae.
As Steve Jobs said,
"I think the biggest innovations of the 21st century will be at the
intersection of biology and technology. A new era is beginning."
Identification of human disease genes is one of major tasks of
modern genomics. Current research has revealed thousands of disease risk
variants, many signal pathways, greatly improving our understanding of
roles of genetics in human diseases. However, a large portion of heritability
is still unexplained. Powerful statistical methods are in demand to reduce
the gap from current discoveries to tagging causal genes, dissecting the
underlying disease etiology, and eventually translational research, clinical
practice and personalized medicine.
My research focuses on statistical methods and applications to genetic
epidemiology and bioinformatics. The goal is to utilize my statistical
skills and my comprehensive background in both genetics to discover novel
genetic risk factors to human diseases. I am dedicated in reducing the
gap from current genetic discoveries to future research. My research topics
cover multiple challenging issues, including (1) rare variants, (2) multiple
phenotypes, (3) integrative analysis via genetic interaction networks,
(4) joint analysis of familial and unrelated data to enlarge sample size,
(5) identification of genomic signals as biomarkers, and (6) other susceptibility
sources such as proteomics.
Statistical genetics
1. Semi-parametric models for multiple phenotypes analysis
in genetic association studies [6]
Studying multiple phenotypes from the same set of subjects can maximally
gain the association mapping between genotypes
and phenotypes, in addition to economical consideration. However, case-control
study does not constitute a random sample
from the general population. If a design ignored the case-control sampling,
the association between genetic variant and
secondary phenotype can be biased. We propose a general approach for estimating
and testing the population effect of a
genetic variant on a secondary phenotype, by correcting such a bias. Our
approach is based on inversely-weighting estimating
equations using a conditional probability of an individual being a case
or a control as the corresponding weight. Our model is
substantially more robust to model misspecification, and out-performs
a likelihood-based analysis, both in terms of validity
and power. Our model is also much more computationally efficient. These
advantages made our approach a practical tool for
genome-wide genetic association studies.
Next, we continue our previous studies and extend to analysis of multiple
phenotypes using sequencing data. We continue to use inverse-weighted
estimating equations with a conditional probability of an individual being
the status of primary trait as the corresponding weight. However, when
the primary trait is continuous, we need to induce a latent binary variable
and connect it to the continuous primary trait. Unbiasedness holds with
careful examination of model consistency and identifiability. Such a design
can be important, for example, that hundred of phenotypes other than the
selected ones for sequencing in CHARGE study can not be analyzed due to
such biases, if we ignore the sampling of the selected traits. Re-sequencing
hundreds of phenotypes is also not financially feasible.In addition, our
design considers missing data, family structure, rare variants, meta-analysis,
and pleiotropic and polygenic analysis. We will also consider the influence
of different study designs especially the case-cohort design for CHARGE
sequencing. We test our models using CHARGE sequencing data. User-friendly
software will be released for the public community.
2. Nonparametric Bayes modeling of genetic interaction
networks for integratively identifying sets of genetic risks of human
diseases [1, 3, 5, 22, 24]
A better understanding of human complex diseases in the future relies
on better dissecting the underlying genetic interaction
networks considering various interactions. It is a challenging problem
to model interaction networks of high
dimensional data, particularly in cases in which one wishes to avoid strong
assumptions about the dependence structure. We
develop a nonparametric Bayes approach that defines a prior with full
support on the space of distributions. This support
condition ensures that we are not restricting the dependence structure
a priori. We show this can be accomplished through a
Dirichlet process mixture of product multinomial distributions, which
is also a convenient form for posterior computation.
We have extended this work into a series of explorations for integratively
identifying sets of disease risks of genotypes,
environmental factors, disease traits and all other relevant risk factors.
Next, we aim to integratively identify sets of disease risk factors
including genotypes, environmental factors, fat-intake variables, disease
traits and all relevant factors. We extend from our model in [1] to consider
mixed data types of discrete and continuous data, family structure, sequencing
data, prior incorporation and efficient posterior calculation. Compared
to previous models, our model produces genetic interaction networks without
any restricted assumptions, while avoiding false positives. We incorporate
discrete and continuous data through a Dirichlet process mixture of product
multinomial distributions that models the joint distribution of all variables.
From the joint distribution, the genetic interaction network can be learned.
Via a latent variable labeling process, our model can optimally capture
genetic association between the trait and the variables of interest. We
incorporate family structure via a genetic random effect variable. We
consider sequencing data using SKAT-based method and/or the approach.
We apply our models to explore a network of genes that might interact
with dietary fat intake variables to influence trabecular bone density
of L3 vertebra measured using quantitative computed tomography. I have
put this into a NIH R01
grant proposal.
3. Unified analyses of case-control & familial data
using a novel Bayesian framework [4]
Several central issues continue to hamper efforts to improve the power
of mapping trait loci in complex human disease.
Inadequate power can yield genetic associations that can be extremely
inaccurate, unstable, and even meaningless. We
proposed a Bayesian framework that allows for a unified analysis of unrelated
and family based data to address some of such
central issues including gene-environment interactions. Joint modeling
increases the sample size and, therefore, promises the
improved power. We incorporate gene-gene and gene-environment interactions
using our previous model in [1], considering
various data types for traits, population stratification and family structure.
We can conveniently model discrete traits using
logistic and multinomial regression models, and continuous traits using
normal and mixture of normal distributions. A
matched case-control design naturally handles population stratification
inherent in population-based samples, while
correlation effects in family-based data are properly modeled. We will
develop free software tools for the public research
community. We will test our methods on three Parkinsons datasets:
the Mayo (LEAPs) study, the NINDS PD study, and the
CIDR Familial PD study. The preliminary tests using a binary phenotype
supported that our framework works efficiently. I
have put above plans into a R03 NIH grant proposal with me as the Principal
Investigator. The review returns high scores
with impact factor of 29, with < 30 having a chance to get money.
4. An evolution-based approach to identify rare disease-risk
variants avoids large sample size and low power [25]
We propose a novel model to identify rare risk variants in next generation
sequencing data. Identification of rare disease-risk
variants is one of central tasks in analyzing sequencing data. Current
statistical methods are based on assembling rare variants
for gaining higher statistical power, not necessarily reflecting the underlying
biological mechanism. Alternatively, we
propose a model that measures conserved patterns such as splice sites
instead of mutations such as SNPs. We then combine
the conserved patterns with identity by decedent (IBD) mapping segments
in long range for identifying rare variants. This
assumes that rare variants happen recent generations and tend to have
longer IBD segments. Because not all case subjects
include rare-risk variants due to its low occurrence, we can form a problem
based on a small subset that include the rare
variants of interest. The conserved patterns and IBD mapping segments
can be used as genetic units in association studies.
We expect to achieve to a reasonable power with relatively small sample
size but higher frequency of risk variants.
5. Performance of statistical methods on CHARGE Targeted
Sequencing data [10]
The CHARGE-S (Cohorts for Heart and Aging Research in Genomic Epidemiology-Sequencing)
project is a national,
collaborative effort from 3 studies: Framingham Heart Study (FHS), Cardiovascular
Health Study (CHS), and
Atherosclerosis Risk in Communities (ARIC). CHARGE-S used a case-cohort
based design, whereby a random sample of
study participants is enriched by subjects in extremes of traits. We aim
to evaluate the performance of statistical methods in a
case-cohort design of CHARGE targeted sequencing data. We provide some
guidelines on the performance of statistical
methods to detect rare variants on CHARGE targeted sequencing data, based
on the case-cohort study design used in
CHARGE sequencing.
6. Power analysis of Illumina Omni 5 in detecting novel
susceptible variants [9]
Emerging sequencing technologies can virtually characterize all the variants
with a spectrum of minor allele frequencies
(MAFs) ranging from 0 to 0.5. However, the cost is still rather high.
Illumina recently released Omni 5 genotype array with ~
5 million assayed SNPs that can balance cost and array density. In this
study, we investigate the power of Omni 5 in detecting
susceptible variants. We evaluate theoretical power and also the power
of detecting susceptible variants in several studies of
FHS, femoral neck bone mineral density, lumbar spine bone mineral density
and hippocampal volume. The observations
support that arrays with denser markers have stronger power in detecting
susceptible variants.
Bioinformatics
7. Nonparametric Bayes ensemble learning to integrate
multiple sources for predicting protein-protein interactions and applications
to human diseases[2].
The computational prediction of protein-protein interactions (PPIs) is
an important task in post-genomic era. PPIs not only play important roles
in fundamental cellular processes but also a major susceptibility source
to human complex diseases and crucial in discovery of new drugs and pharmaceuticals.
One of major difficulties for current prediction of PPIs is rather high
false positives, which can be as high as >80%. No computational method
was designed to effectively reduce false positives. We propose a novel
Bayesian integration method, deemed nonparametric Bayes ensemble learning
(NBEL), to lower the misclassification rate through automatically up-weighting
data sources that are most informative, while down-weighting less informative
and biased sources. Extensive studies indicate that NBEL is significantly
more robust than the classic naïve Bayes and regression model to
unreliable, error-prone and contaminated data. The validation on three
experimental datasets with high quality supports our observations. Our
method may speed up PPIs prediction with high quality. Such a reliable
prediction may provide a solid platform to other studies such as prediction
of protein functions and roles of PPIs in disease susceptibility. This
work is published at PLoS Computational Biology and is reported by GeomeWeb
(http://www.genomeweb.com/blog/week-plos-147).
Many future works are left for obtaining more reliable and dynamic inferences
of PPIs. Particularly, information sharing exists in the different data
sources. An effective integration method releasing the restriction of
the conditional independence is therefore in demand. In addition, because
the network of PPIs is essentially time evolving, an approach to modeling
a network of PPIs dynamically is under plan by using time series modeling.
We will test our models using human PPIs data from literature and the
data that are collected by ourselves. A possible database may be built
for newly discovered PPIs. We will work with our collaborators
to validate our new discoveries, and we are interested in applying them
in detecting their risks to human diseases. The ability
to predict large number of PPIs both reliably and automatically may speed
up PPIs prediction with high quality. The reliable
prediction from our method may benefit other studies such as protein functions
prediction and roles of PPIs in disease
susceptibility.
8. Identification of genomic signals using free energy
and signal processing approaches [12-16].
Genomic signals define genes and therefore translated proteins that are
basic molecules regulating human health. We aim to
identify the critical genomic signals such as protein-coding genes, slice
sites and pseudogenes. Instead of pure statistical
evaluation, we are dedicated in answering why and explore
further by measuring the underlying interactions between the 3
tail of 18S rRNA and mRNA using free energy. We discover a period-3, free
energy signal in coding regions, which is not
found in non-coding regions. We further use this period-3 signal for identifying
the protein-coding sequences by defining the
statistical features of period 3 signal. We test on the eukaryotic genes
of Saccharomyces cerevisiae, Schizosaccharomyces
pombe, fly, mouse, and human. Our experiments indicate improved performance
compared to other methods. The tests on
pseudogenes indicate that most pseudogenes have no period-3 signal. We
also apply our approach for the identification of
splice sites systematically. Genomic signals discovered from our approach
have wide applications, and can be used as biomarkers in revealing disease
etiology and therefore personalized medicine.
Collaborative Research
My collaborative research includes studies of the genetics of osteoporosis,
stroke, diabetes, Parkinsons disease, and sickle
cell disease. My collaborators are mostly clinical practitioners, epidemiologists,
and geneticists mainly from Boston
University, Harvard University, and other local universities. The major
statistical methods include regression models,
Bayesian statistics, semi-parametric methods, multivariate statistics
and multiple testing using genotypes, microarray and
proteomic data. In addition to statistical consultation, I apply my methods
and also develop new methods for my
. I give only a couple of examples here. One collaborator from Harvard
University School of Medicine is
interested in fat intake by gene interaction on quantitative computed
tomography derived bone density. Existing statistical
models are restricted to various independence assumptions for learning
genetic interaction networks. The variables for
genetic interaction networks include genotypes, lumba-3 bone mineral density,
fat intake variables, environmental factors and
covariate variables such as age and weight. I have developed a model using
nonparametric Bayes method for categorical data
[1], but variables such as trait and weight are continuous data [3, 5].
To better resolve this issue, I propose a new method that
considers mixed discrete and continuous data. Another collaborator from
Boston University School of Medicine is interested
in extracting genetic risk information from each of two datasets, genotypes
and microarray data, about sickle cell disease and
then combining them together. We apply our recent publication [7] that
provides an improved performance based on the
differential analysis of sets of genes for microarray data. I also supervise
and provide guidance to biostatistical juniors
including Ph.D. students and analysts.
Please
contact me if you would like to discuss with me more my research via my
email: chuanhua at bu dot edu.
|