The human genome contains around 20,000 genes. Pretty much everything that makes humans so special, and everything that can go wrong in disease, relates to the function and regulation of these 20,000 genes. Our lab is interested in what makes these genes tick. How is each gene regulated so as to be expressed when and where it is needed? Which mutations in the genome lead to pathogenic gene activity? Can we fix altered states of gene regulation and gene function to treat human diseases?
Research in the lab is organized into four interconnected themes that tackle the different angles of this overall goal.
DNA double-strand break repair following cleavage by Cas9 is generally considered stochastic, heterogeneous, and impractical for applications beyond gene disruption. Our recent research overturns this dogma, showing that template-free Cas9 nuclease-mediated DNA repair is predictable in human and mouse cells and is capable of precise repair to a predicted genotype in certain sequence contexts, enabling correction of human disease-associated mutations.
To capture Cas9-mediated end-joining repair products across a wide variety of target sequences, we designed a genome-integrated gRNA and target library screen in which 2,000 gRNAs were paired with corresponding target sequences containing a single canonical “NGG” SpCas9 PAM that directs cleavage to the center of each target sequence. Using this wealth of Cas9 outcome data, we trained a novel and highly accurate machine learning model, inDelphi, to predict the spectrum of Cas9-mediated editing products at a given target site at single-base resolution. One of the most striking predictions from inDelphi is that certain sequence contexts, including those with strong local microhomology which favor a specific deletion outcome or those with weak local microhomology which favor a specific 1-bp insertion outcome, predominantly favor specific genotypic outcomes after template-free Cas9-mediated repair. inDelphi predicts that 5-11% (depending on cell type, as we have observed that outcome precision varies slightly in different cell types) of SpCas9 gRNAs that target human exons and introns are Precision50 gRNAs, which we define as gRNAs predicted to produce a single genotypic outcome in ≥50% of all editing products.
This new concept of precise Cas9 editing outcomes opens new doors both in research and therapeutic applications. We are pursuing the application of new CRISPR/Cas9 screening platforms that leverage our a priori prediction of majority genotypic outcomes of editing. We are also pursuing the use of precise, template-free CRISPR/Cas9 editing for therapeutic gain-of-function disease correction.
Shen MW*, Arbab M*, Hsu JY, Worstell DW, Culbertson SJ, Krabbe O, Cassa CA, Liu DR, Gifford DK, Sherwood RI. Predictable and precise template-free CRISPR editing of pathogenic variants. Nature. epub Nov 7 2018.
Our work on transcription factor binding logic started with a basic question that we believed would hold the key to a priori prediction of transcription factor binding: why do transcription factors bind to only a small fraction (~1-10%) of the genomic instances of their binding motifs? Answering this question using the existing state-of-the-art tool, ChIP-seq, was untenable given that every cell type expresses hundreds of transcription factors, making their ChIP-seq profiling technically impractical. Instead, we sought a single experimental approach that could reveal the binding patterns of hundreds of transcription factors at once. In concert with David Gifford’s computational lab, we found such a technique in DNase-seq, in which each transcription factor leaves a signature “profile,” allowing us to train a machine learning algorithm to predict the binding patterns of over 700 transcription factors from a single experimental dataset at equivalent accuracy to ChIP-seq for each transcription factor.
Armed with this unprecedented dataset on where transcription factors bind, we could examine what was preventing transcription factors from binding to the vast majority of their genomic motifs. We found that most genomic DNA is “closed” by tightly wound histones and is inaccessible to transcription factors. A small set of transcription factors (pioneer factors) are capable of opening chromatin, and all other transcription factors depend on open chromatin to bind. We uncovered a subset of transcription factors (settler factors) that bind DNA in every instance that their motif occurs in open chromatin (see figure on right for a model). Put another way, we could for the first time predict a priori where in the genome certain transcription factors (settlers) would bind simply by knowing their binding motifs and which regions of the genome have open chromatin.
This work highlights that chromatin accessibility is a major determinant of transcription factor binding; however, these same data indicated that pioneer factors, in spite of having the ability to open previously closed chromatin, still only bind to a fraction of their genomic motifs. So, what is limiting binding of the factors that induce chromatin accessibility? In concert with David Gifford’s lab, we addressed this issue by training a machine learning algorithm, the Synergistic Chromatin Model (SCM) to predict genome-wide chromatin accessibility from DNA sequence alone. Excitingly, we devised an algorithm that can predict chromatin accessibility with astounding accuracy down to the basepair level.
Yet can this algorithm predict accessibility of novel DNA sequences in a controlled genomic context? To this end, we invented a technique, Single Locus Oligonucleotide Transfer (SLOT), in which a library of thousands of rationally designed DNA phrases are inserted into any genomic locus of interest. SCM accurately predicts the accessibility level of thousands of DNA phrases inserted into several fixed genomic loci using the SLOT assay (see figure on left), providing elegant proof that we have identified a DNA-encoded logic that causes chromatin accessibility. The algorithm’s accuracy is derived from its modeling of chromatin accessibility as a non-specific synergy of chromatin opening by pioneer factors. When pioneer factors bind nearby other pioneers, they synergize to establish chromatin accessibility, whereas alone they are too weak to have a noticeable effect. Not only does this work propose a coherent hierarchical logic for how transcription factors bind, with cohorts of pioneer factors establishing open chromatin to be occupied by settler factors, it also explains how and why pioneer factors can open chromatin at only a fraction of their genomic motifs.
Our ongoing research into deciphering transcription factor binding logic revolves around predicting the dynamic transcription factor binding that enables cell state change, as this is the key to our ultimate goal of enabling computationally driven cellular reprogramming. Of particular interest is how key signaling pathways such as Wnt enact cell type-specific responses.
Sherwood RI*, Hashimoto T*, O’Donnell CW*, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T, Gifford DK. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology. 2014 Feb; 32(2): 171-8.
Barkal AA, Srinivasan S, Hashimoto T, Gifford DK, Sherwood RI. Cas9 Functionally Opens Chromatin. PLOS ONE. 2016; 11(3): e0152683.
Hashimoto T*, Sherwood RI*, Kang DD*, Rajagopal N, Barkal AA, Zeng H, Emons BJM, Srinivasan S, Jaakkola T, Gifford DK. A Synergistic DNA Logic Predicts Genome-wide Chromatin Accessibility. Genome Research. Accepted for publication.
Less than 2% of the human genome encodes genes, and thus most somatic variation occurs in non-coding regions of the genome. It is fast becoming routine to perform whole genome sequencing on patients with genetic disease or cancer, yet our ability to harness genomic data to predict the impact of non-coding genotype on gene expression is rudimentary. We have developed a new technology, the Multiplexed Editing Regulatory Assay (MERA), that experimentally links non-coding genotype with gene expression in high-throughput. For a given gene, a library of CRISPR guide RNAs (gRNAs) is constructed to target up to 300 kb of non-coding genomic space surrounding that gene. The gene of interest is labeled with GFP in embryonic stem cells (ESCs), and the gRNA library is added such that each cell has a single focal mutation (on average affecting ~10 bp) in the surrounding genomic space. By detecting gRNAs enriched in GFP- cells, we identify which regions in the 300 kb of assayed non-coding genomic space are required for expression of the target gene.
To implement MERA, we first constructed GFP fusion proteins to track the expression of these genes using an innovative new protocol we have developed to perform gene knock-in with reduced effort and increased throughput as compared to the standard procedure. Typical gene knock-in requires 2 weeks to clone a locus-specific homology arm construct and clone a gRNA; our technique uses a self-cloning gRNA and a PCR-amplified short homology arm fragment to achieve efficient locus-specific GFP knock-in with a total of 2 hours of preparation time for all knock-in components (see figure below).
We then developed an innovative, elegant strategy to perform gRNA screening with maximal efficiency and minimal preparation time. In this strategy, we use cellular homologous recombination to replace a genomically integrated dummy gRNA with one of the gRNAs from the library (see below summary figure).
This homologous recombination strategy eliminates the laborious two-week-long process of lentiviral gRNA library cloning and, because there is only a single genomically integrated dummy gRNA, guarantees that each cell will only receive a single gRNA from the library.
Using MERA on four embryonic stem cell-specific genes, we found that the gRNAs targeting the expected regions such as the GFP sequence, the gene body, promoter, and known enhancers were the most likely to induce loss of GFP expression, proving that MERA effectively detects genomic regions required for gene expression (see figure below). More interestingly, we also found a number of unexpected non-coding regions required for gene expression. These include the promoters of neighboring genes and a set of regions with no known DNase I hypersensitivity or histone modifications associated with active chromatin (see figure below).
Ultimately, we aim to use MERA to develop a predictive computational model of how non-coding genotype impacts gene expression, which would be a boon for interpreting patient whole genome sequencing data. We are extending MERA to dozens of gene loci, aiming to reveal novel gene regulatory paradigms by identifying patterns across the datasets.
Rajagopal N, Srinivasan S, Kooshesh K, Guo Y, Edwards M, Banerjee B, Syed T, Emons BJM, Gifford DK, Sherwood RI. High-throughput mapping of regulatory DNA. Nature Biotechnology.2016; 34, 167-174.
Arbab M, Srinivasan S, Hashimoto T, Geijsen N, Sherwood RI. Cloning-Free CRISPR. Stem Cell Reports. 2015 Nov 10; 5(5):908-917.
Cellular reprogramming holds the promise to transform how human disease is approached. However, discovering how to reprogram to cell types of interest is laborious, expensive, and to date entirely empirical, taking months to years of laborious trial-and-error. It is our aim to change this. Stay tuned for updates in this area.