# Predicting Genetic Variants Affecting Gene Regulation

A large fraction of genetic variants that have been associated with complex traits are found outside of protein coding genes and likely affect gene regulation. Many experimental efforts have been dedicated to mapping regulatory regions in the genome but there are not many systematic methods that integrate functional data and regulatory sequences to predict the potential effect of any genetic variant on any given tissue and motif. Large experimental efforts are characterizing the regulatory genome, yet we are still missing a systematic definition of functional and silent genetic variants in non-coding regions.

In (Moyerbrailean, PLoS Genetics), we developed a novel approach that integrates sequence information and DNase I footprinting data to predict the impact of a sequence change on transcription factor binding. Our approach extends CENTIPEDE model uses footprint information to iteratively improve the motif model (Figure 1).

Figure 1: CentiSNPs annotation. (A) Data sources (B) Iterative process of using CENTIPEDE and seed sequence models (bottom left) to call footprints (top), then to revise the sequence models (bottom right), and call footprints again. (C) Computational predictions of genetic variant impact on factor binding. Conditional on a motif sequence match and observing a DNase-seq footprint a prediction is made using CENTIPEDE’s logistic model for the the prior probability of binding for each allele: $p_H$ for the high binding allele (upward triangle), and $p_L$ for the lower binding allele (downward triangle). (D) SNPs in non-coding regions are successively classified into nested categories base on being in a DHS, CENTIPEDE footprints and having a predicted functional impact on binding (based on  $p_H$ and $p_L$.)

Applying this approach to 653 DNase-seq samples from the ENCODE and the Roadmap Epigenomics project, we predicted the impact of a genetic variant on TF binding across 153 tissues and 1,372 TF motifs. Each annotation we derived is specific for a cell-type condition or assay and is locally motif-driven. We found 5.8 million genetic variants in footprints, 66% of which are predicted by our model to affect TF binding. Comprehensive examination using allele-specific hypersensitivity (ASH) reveals that only the latter group consistently shows evidence for ASH (3,217 SNPs at 20% FDR), suggesting that most (97%) genetic variants in footprinted regulatory regions are indeed silent.

Annotating those that are not silent allows us to investigate the molecular basis for the genetic architecture of many common traits and also to study the evolutionary properties that different types of regulatory sequences have across tissues or transcription factors. Overall, our study supports the concept that polygenic variation in binding sites for distinct classes of transcription factors has been a major target of evolutionary forces contributing to disease risk and complex trait variation in humans.

Combining this information with GWAS data reveals that our annotation helps in computationally fine-mapping 86 SNPs in GWAS hit regions with at least a 2-fold increase in the posterior odds of picking the causal SNP. The rich meta information provided by the tissue-specificity and the identity of the putative TF binding site being affected also helps in identifying the underlying mechanism supporting the association. As an example, the enrichment for LDL level-associated SNPs is 9.1-fold higher among SNPs predicted to affect HNF4 binding sites than in a background model already including tissue-specific annotation.

Similarly, we used this annotation for fine-mapping eQTLs by re-analyzing the GEUVADIS data-set using a new meta-analysis method that can leverage multiple populations (Wen et al., 2014, PLoS Genetics). A large fraction of genetic variants in footprints do not seem to have an impact on gene expression, but those that are predicted to have an impact on binding are more likely to affect gene expression. We are currently using data from GTEx and other close collaborators to develop a deeper understanding on the regulatory grammar of gene regulatory sequences.