The human genome contains around 20,000 genes. While there are an astronomical number of possible combinations of expressed genes, the entire human body is composed of a relatively small fraction of these gene expression combinations, or cell states. Amazingly, we now know that altering the expression of a small number of transcription factors can reprogram cell state. There is immense therapeutic promise in such reprogramming for a myriad of purposes, from modeling disease mechanisms in a dish to performing personalized drug screening to transplanting reprogrammed cells, wildtype or even genetically modified, into patients. This presents an exciting, albeit formidable, challenge—if we crack the code that governs how transcription factors control the finite number of gene expression combinations in the human body, we will be able to predictively control these cell states in the future. Our long-term goal is to create a computational algorithm that accurately predicts the consequences of altering the protein or DNA content of a cell on its gene expression state.

 
Research in the lab is organized into three interconnected themes that tackle the different angles of this overall goal. They are nicely illustrated by the “bullseye” graph below.

Our work on transcription factor binding logic started with a basic question that we believed would hold the key to a priori prediction of transcription factor binding: why do transcription factors bind to only a small fraction (~1-10%) of the genomic instances of their binding motifs? Answering this question using the existing state-of-the-art tool, ChIP-seq, was untenable given that every cell type expresses hundreds of transcription factors, making their ChIP-seq profiling technically impractical. Instead, we sought a single experimental approach that could reveal the binding patterns of hundreds of transcription factors at once. In concert with David Gifford’s computational lab, we found such a technique in DNase-seq, in which each transcription factor leaves a signature “profile,” allowing us to train a machine learning algorithm to predict the binding patterns of over 700 transcription factors from a single experimental dataset at equivalent accuracy to ChIP-seq for each transcription factor.

 

Armed with this unprecedented dataset on where transcription factors bind, we could examine what was preventing transcription factors from binding to the vast majority of their genomic motifs. We found that most genomic DNA is “closed” by tightly wound histones and is inaccessible to transcription factors. A small set of transcription factors (pioneer factors) are capable of opening chromatin, and all other transcription factors depend on open chromatin to bind. We uncovered a subset of transcription factors (settler factors) that bind DNA in every instance that their motif occurs in open chromatin (see figure on right for a model). Put another way, we could for the first time predict a priori where in the genome certain transcription factors (settlers) would bind simply by knowing their binding motifs and which regions of the genome have open chromatin.

 

This work highlights that chromatin accessibility is a major determinant of transcription factor binding; however, these same data indicated that pioneer factors, in spite of having the ability to open previously closed chromatin, still only bind to a fraction of their genomic motifs. So, what is limiting binding of the factors that induce chromatin accessibility? In concert with David Gifford’s lab, we addressed this issue by training a machine learning algorithm, the Synergistic Chromatin Model (SCM) to predict genome-wide chromatin accessibility from DNA sequence alone. Excitingly, we devised an algorithm that can predict chromatin accessibility with astounding accuracy down to the basepair level.

 

Yet can this algorithm predict accessibility of novel DNA sequences in a controlled genomic context? To this end, we invented a technique, Single Locus Oligonucleotide Transfer (SLOT), in which a library of thousands of rationally designed DNA phrases are inserted into any genomic locus of interest. SCM accurately predicts the accessibility level of thousands of DNA phrases inserted into several fixed genomic loci using the SLOT assay (see figure on left), providing elegant proof that we have identified a DNA-encoded logic that causes chromatin accessibility. The algorithm’s accuracy is derived from its modeling of chromatin accessibility as a non-specific synergy of chromatin opening by pioneer factors. When pioneer factors bind nearby other pioneers, they synergize to establish chromatin accessibility, whereas alone they are too weak to have a noticeable effect. Not only does this work propose a coherent hierarchical logic for how transcription factors bind, with cohorts of pioneer factors establishing open chromatin to be occupied by settler factors, it also explains how and why pioneer factors can open chromatin at only a fraction of their genomic motifs.

Our ongoing research into deciphering transcription factor binding logic revolves around predicting the dynamic transcription factor binding that enables cell state change, as this is the key to our ultimate goal of enabling computationally driven cellular reprogramming. Of particular interest is how key signaling pathways such as Wnt enact cell type-specific responses.

Key publications:

Sherwood RI*, Hashimoto T*, O’Donnell CW*, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T, Gifford DK. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology. 2014 Feb; 32(2): 171-8.

Barkal AA, Srinivasan S, Hashimoto T, Gifford DK, Sherwood RI. Cas9 Functionally Opens Chromatin. PLOS ONE. 2016; 11(3): e0152683.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4816323/

Hashimoto T*, Sherwood RI*, Kang DD*, Rajagopal N, Barkal AA, Zeng H, Emons BJM, Srinivasan S, Jaakkola T, Gifford DK. A Synergistic DNA Logic Predicts Genome-wide Chromatin Accessibility. Genome Research. Accepted for publication.

Less than 2% of the human genome encodes genes, and thus most somatic variation occurs in non-coding regions of the genome. It is fast becoming routine to perform whole genome sequencing on patients with genetic disease or cancer, yet our ability to harness genomic data to predict the impact of non-coding genotype on gene expression is rudimentary. We have developed a new technology, the Multiplexed Editing Regulatory Assay (MERA), that experimentally links non-coding genotype with gene expression in high-throughput. For a given gene, a library of CRISPR guide RNAs (gRNAs) is constructed to target up to 300 kb of non-coding genomic space surrounding that gene. The gene of interest is labeled with GFP in embryonic stem cells (ESCs), and the gRNA library is added such that each cell has a single focal mutation (on average affecting ~10 bp) in the surrounding genomic space. By detecting gRNAs enriched in GFP- cells, we identify which regions in the 300 kb of assayed non-coding genomic space are required for expression of the target gene.

To implement MERA, we first constructed GFP fusion proteins to track the expression of these genes using an innovative new protocol we have developed to perform gene knock-in with reduced effort and increased throughput as compared to the standard procedure. Typical gene knock-in requires 2 weeks to clone a locus-specific homology arm construct and clone a gRNA; our technique uses a self-cloning gRNA and a PCR-amplified short homology arm fragment to achieve efficient locus-specific GFP knock-in with a total of 2 hours of preparation time for all knock-in components (see figure below).

We then developed an innovative, elegant strategy to perform gRNA screening with maximal efficiency and minimal preparation time. In this strategy, we use cellular homologous recombination to replace a genomically integrated dummy gRNA with one of the gRNAs from the library (see below summary figure).

This homologous recombination strategy eliminates the laborious two-week-long process of lentiviral gRNA library cloning and, because there is only a single genomically integrated dummy gRNA, guarantees that each cell will only receive a single gRNA from the library.

 

Using MERA on four embryonic stem cell-specific genes, we found that the gRNAs targeting the expected regions such as the GFP sequence, the gene body, promoter, and known enhancers were the most likely to induce loss of GFP expression, proving that MERA effectively detects genomic regions required for gene expression (see figure below). More interestingly, we also found a number of unexpected non-coding regions required for gene expression. These include the promoters of neighboring genes and a set of regions with no known DNase I hypersensitivity or histone modifications associated with active chromatin (see figure below).

Ultimately, we aim to use MERA to develop a predictive computational model of how non-coding genotype impacts gene expression, which would be a boon for interpreting patient whole genome sequencing data. We are extending MERA to dozens of gene loci, aiming to reveal novel gene regulatory paradigms by identifying patterns across the datasets.

Key publications:

Rajagopal N, Srinivasan S, Kooshesh K, Guo Y, Edwards M, Banerjee B, Syed T, Emons BJM, Gifford DK, Sherwood RI. High-throughput mapping of regulatory DNA. Nature Biotechnology.2016; 34, 167-174.

http://www.nature.com/nbt/journal/v34/n2/full/nbt.3468.html

Arbab M, Srinivasan S, Hashimoto T, Geijsen N, Sherwood RI. Cloning-Free CRISPR. Stem Cell Reports. 2015 Nov 10; 5(5):908-917.

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4649464/

Cellular reprogramming holds the promise to transform how human disease is approached. However, discovering how to reprogram to cell types of interest is laborious, expensive, and to date entirely empirical, taking months to years of laborious trial-and-error. It is our aim to change this. Stay tuned for updates in this area.