The human genome contains around 20,000 genes. Pretty much everything that makes humans so special, and everything that can go wrong in disease, relates to the function and regulation of these 20,000 genes. Our lab is interested in what makes these genes tick. How is each gene regulated so as to be expressed when and where it is needed? Which mutations in the genome lead to pathogenic gene activity? Can we fix altered states of gene regulation and gene function to treat human diseases?
Research in the lab revolves around identifying new high-throughput screening platforms and machine learning algorithmic solutions that tackle the different angles of this overall goal. Below are some vignettes from previously published projects. We are constantly developing new technologies to answer these fundamental questions, so these represent but a few of the many exciting ongoing projects in the lab.
DNA double-strand break repair following cleavage by Cas9 is generally considered stochastic, heterogeneous, and impractical for applications beyond gene disruption. Our recent research overturns this dogma, showing that template-free Cas9 nuclease-mediated DNA repair is predictable in human and mouse cells and is capable of precise repair to a predicted genotype in certain sequence contexts, enabling correction of human disease-associated mutations.
To capture Cas9-mediated end-joining repair products across a wide variety of target sequences, we designed a genome-integrated gRNA and target library screen in which 2,000 gRNAs were paired with corresponding target sequences containing a single canonical “NGG” SpCas9 PAM that directs cleavage to the center of each target sequence. Using this wealth of Cas9 outcome data, we trained a novel and highly accurate machine learning model, inDelphi, to predict the spectrum of Cas9-mediated editing products at a given target site at single-base resolution. One of the most striking predictions from inDelphi is that certain sequence contexts, including those with strong local microhomology which favor a specific deletion outcome or those with weak local microhomology which favor a specific 1-bp insertion outcome, predominantly favor specific genotypic outcomes after template-free Cas9-mediated repair. inDelphi predicts that 5-11% (depending on cell type, as we have observed that outcome precision varies slightly in different cell types) of SpCas9 gRNAs that target human exons and introns are Precision50 gRNAs, which we define as gRNAs predicted to produce a single genotypic outcome in ≥50% of all editing products.
This new concept of precise Cas9 editing outcomes opens new doors both in research and therapeutic applications, which we are actively pursuing. We have followed up on this work to understand the DNA repair pathways that drive distinct repair outcomes, finding that inhibition of DNA-PK increases microhomology deletions and inhibition of ATM kinase increases 1-bp insertions. The ability to manipulate these outcome classes raises new prospects in how to control Cas9-nuclease in genetic screens and for therapeutic purposes.
Shen MW*, Arbab MA*, Hsu J, Worstell D, Culbertson SJ, Krabbe O, Cassa CA, Liu DR, Gifford DK, Sherwood RI. Predictable and precise template-free editing of pathogenic mutations by CRISPR-Cas9 nuclease. Nature. 2018 Nov;563(7733): 646-651. PMID 30405244.
Bermudez-Cabrera HC, Culbertson S, Barkal S, Holmes B, Shen MW, Zhang S, Gifford DK, Sherwood RI. Small molecule inhibition of ATM kinase increases CRISPR-Cas9 1-bp insertion frequency. Nature Communications. 2021 Aug 25;12(1):5111. doi: 10.1038/s41467-021-25415-8. PMID: 34433825
Our work on transcription factor binding logic started with a basic question that we believed would hold the key to a priori prediction of transcription factor binding: why do transcription factors bind to only a small fraction (~1-10%) of the genomic instances of their binding motifs? Answering this question using the existing state-of-the-art tool, ChIP-seq, was untenable given that every cell type expresses hundreds of transcription factors, making their ChIP-seq profiling technically impractical. Instead, we sought a single experimental approach that could reveal the binding patterns of hundreds of transcription factors at once. In concert with David Gifford’s computational lab, we found such a technique in DNase-seq, in which each transcription factor leaves a signature “profile,” allowing us to train a machine learning algorithm to predict the binding patterns of over 700 transcription factors from a single experimental dataset at equivalent accuracy to ChIP-seq for each transcription factor.
Armed with this unprecedented dataset on where transcription factors bind, we could examine what was preventing transcription factors from binding to the vast majority of their genomic motifs. We found that most genomic DNA is “closed” by tightly wound histones and is inaccessible to transcription factors. A small set of transcription factors (pioneer factors) are capable of opening chromatin, and all other transcription factors depend on open chromatin to bind. We uncovered a subset of transcription factors (settler factors) that bind DNA in every instance that their motif occurs in open chromatin (see figure on right for a model). Put another way, we could for the first time predict a priori where in the genome certain transcription factors (settlers) would bind simply by knowing their binding motifs and which regions of the genome have open chromatin.
This work highlights that chromatin accessibility is a major determinant of transcription factor binding; however, these same data indicated that pioneer factors, in spite of having the ability to open previously closed chromatin, still only bind to a fraction of their genomic motifs. So, what is limiting binding of the factors that induce chromatin accessibility? In concert with David Gifford’s lab, we addressed this issue by training a machine learning algorithm, the Synergistic Chromatin Model (SCM) to predict genome-wide chromatin accessibility from DNA sequence alone. Excitingly, we devised an algorithm that can predict chromatin accessibility with astounding accuracy down to the basepair level.
Yet can this algorithm predict accessibility of novel DNA sequences in a controlled genomic context? To this end, we invented a technique, Single Locus Oligonucleotide Transfer (SLOT), in which a library of thousands of rationally designed DNA phrases are inserted into any genomic locus of interest. SCM accurately predicts the accessibility level of thousands of DNA phrases inserted into several fixed genomic loci using the SLOT assay (see figure on left), providing elegant proof that we have identified a DNA-encoded logic that causes chromatin accessibility. The algorithm’s accuracy is derived from its modeling of chromatin accessibility as a non-specific synergy of chromatin opening by pioneer factors. When pioneer factors bind nearby other pioneers, they synergize to establish chromatin accessibility, whereas alone they are too weak to have a noticeable effect. Not only does this work propose a coherent hierarchical logic for how transcription factors bind, with cohorts of pioneer factors establishing open chromatin to be occupied by settler factors, it also explains how and why pioneer factors can open chromatin at only a fraction of their genomic motifs.
Our ongoing research into deciphering transcription factor binding logic revolves around predicting the dynamic transcription factor binding that enables cell state change, as this is the key to our ultimate goal of enabling computationally driven cellular reprogramming. Of particular interest is how key signaling pathways such as Wnt enact cell type-specific responses.
Sherwood RI*, Hashimoto T*, O’Donnell CW*, Lewis S, Barkal AA, van Hoff JP, Karun V, Jaakkola T, Gifford DK. Discovery of directional and nondirectional pioneer transcription factors by modeling DNase profile magnitude and shape. Nature Biotechnology. 2014 Feb; 32(2): 171-8.
Barkal AA, Srinivasan S, Hashimoto T, Gifford DK, Sherwood RI. Cas9 Functionally Opens Chromatin. PLOS ONE. 2016; 11(3): e0152683. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4816323/
Hashimoto T*, Sherwood RI*, Kang DD*, Rajagopal N, Barkal AA, Zeng H, Emons BJM, Srinivasan S, Jaakkola T, Gifford DK. A Synergistic DNA Logic Predicts Genome-wide Chromatin Accessibility. Genome Res. Epub 2016 Jul 25.
Kang D*, Sherwood RI*, Barkal A, Hashimoto T, Engstrom L, Gifford D. DNase-capture reveals differential transcription factor binding modalities. PLoS One. 2017 Dec 28;12(12):e0187046.
Szczesnik T, Chu L, Ho JWK, Sherwood RI. A High-Throughput Genome-Integrated Assay Reveals Spatial Dependencies Governing Tcf7l2 binding. Cell Systems. 2020 Sept 23;11(3):315-327.e5. PMID: 32910904.
Hammelman J, Krismer K, Banerjee B, Gifford DK, Sherwood RI. Identification of determinants of differential chromatin accessibility through a massively parallel genome-integrated reporter assay. Genome Research. 2020 Sep 24. PMID: 32973041.
Less than 2% of the human genome encodes genes, and thus most somatic variation occurs in non-coding regions of the genome. It is fast becoming routine to perform whole genome sequencing on patients with genetic disease or cancer, yet our ability to harness genomic data to predict the impact of non-coding genotype on gene expression is rudimentary. We have developed a new technology, the Multiplexed Editing Regulatory Assay (MERA), that experimentally links non-coding genotype with gene expression in high-throughput. For a given gene, a library of CRISPR guide RNAs (gRNAs) is constructed to target up to 300 kb of non-coding genomic space surrounding that gene. The gene of interest is labeled with GFP in embryonic stem cells (ESCs), and the gRNA library is added such that each cell has a single focal mutation (on average affecting ~10 bp) in the surrounding genomic space. By detecting gRNAs enriched in GFP- cells, we identify which regions in the 300 kb of assayed non-coding genomic space are required for expression of the target gene.
To implement MERA, we first constructed GFP fusion proteins to track the expression of these genes using an innovative new protocol we have developed to perform gene knock-in with reduced effort and increased throughput as compared to the standard procedure. Typical gene knock-in requires 2 weeks to clone a locus-specific homology arm construct and clone a gRNA; our technique uses a self-cloning gRNA and a PCR-amplified short homology arm fragment to achieve efficient locus-specific GFP knock-in with a total of 2 hours of preparation time for all knock-in components (see figure below).
We then developed an innovative, elegant strategy to perform gRNA screening with maximal efficiency and minimal preparation time. In this strategy, we use cellular homologous recombination to replace a genomically integrated dummy gRNA with one of the gRNAs from the library (see below summary figure).
This homologous recombination strategy eliminates the laborious two-week-long process of lentiviral gRNA library cloning and, because there is only a single genomically integrated dummy gRNA, guarantees that each cell will only receive a single gRNA from the library.
Using MERA on four embryonic stem cell-specific genes, we found that the gRNAs targeting the expected regions such as the GFP sequence, the gene body, promoter, and known enhancers were the most likely to induce loss of GFP expression, proving that MERA effectively detects genomic regions required for gene expression (see figure below). More interestingly, we also found a number of unexpected non-coding regions required for gene expression. These include the promoters of neighboring genes and a set of regions with no known DNase I hypersensitivity or histone modifications associated with active chromatin (see figure below).
Ultimately, we aim to use MERA to develop a predictive computational model of how non-coding genotype impacts gene expression, which would be a boon for interpreting patient whole genome sequencing data. We are extending MERA to dozens of gene loci, aiming to reveal novel gene regulatory paradigms by identifying patterns across the datasets.
Rajagopal N, Srinivasan S, Kooshesh K, Guo Y, Edwards M, Banerjee B, Syed T, Emons BJM, Gifford DK, Sherwood RI. High-throughput mapping of regulatory DNA. Nature Biotechnology.2016; 34, 167-174. http://www.nature.com/nbt/journal/v34/n2/full/nbt.3468.html
Arbab M, Srinivasan S, Hashimoto T, Geijsen N, Sherwood RI. Cloning-Free CRISPR. Stem Cell Reports. 2015 Nov 10; 5(5):908-917. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4649464/
Lin L, Holmes B, Shen MW, Kammeron D, Geijsen N, Gifford DK, Sherwood RI. Comprehensive mapping of key regulatory networks that drive oncogene expression. Cell Reports. 2020 Nov 24;33(8):108426. PMID: 33238122
Yeo GHT, Lin L, Qi CY, Cha M, Gifford DK*, Sherwood RI*. A Multiplexed Barcodelet Single-Cell RNA-Seq Approach Elucidates Combinatorial Signaling Pathways that Drive ESC Differentiation. Cell Stem Cell. 2020 May 21; S1934-5909(20)30159-4. PMID: 32459995.
Background: Embryonic stem cells can be directed to differentiate into a variety of valuable cell types for disease modeling, drug screening, and regenerative medicine via the precise combinatorial and temporal manipulation of a relatively small set of intercellular signaling pathways. Although many protocols have been developed to direct the differentiation of embryonic stem cells into a variety of cell types using lessons learned from embryonic patterning of those cell types, many cell types of therapeutic value are inaccessible because the appropriate spatiotemporal combination of signaling pathways involved in specifying them has not been elucidated. Current methods for discovering protocols to differentiate cells into a desired state are highly empirical.
Findings: We developed barcodelet single-cell RNA sequencing, which allows us to measure transcriptome-wide expression in 32–384 distinct embryonic stem cell-derived populations per experiment at single-cell resolution. We used this approach to systematically explore the combinatorial effects of activation and inhibition of up to seven signaling pathways during embryonic stem cell germ layer patterning. We showed that the expression of an underappreciated fraction of genes are dependent on the combinatorial activation and inhibition of these signaling pathways, and we developed an analysis framework that identifies treatment combinations associated with distinct expression states from a reference cellular atlas, enabling us to propose and validate specific combinatorial signaling logics that give rise to specific cell populations.
Significance: Our findings support a deep network of combinatorial spatiotemporal relationships between signaling pathways that govern embryo patterning, which sometimes occur at the level of individual enhancers or promoters. Our method to perform transcriptomic analysis on hundreds of systematically chosen conditions per experiment opens up possibilities to explore such relationships, and our analytical frameworks advance our ability to direct stem cell fate toward defined populations identified through single-cell atlas projects.