Widespread recognition of 5′ splice sites by noncanonical base-pairing to U1 snRNA involving bulged nucleotides

  1. Adrian R. Krainer1,4
  1. 1Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, 11724, USA;
  2. 2School of Biological Sciences, Division of Molecular Genetics and Cell Biology, Nanyang Technological University, 637551 Singapore;
  3. 3Isis Pharmaceuticals, Carlsbad, California 92008, USA

    Abstract

    An established paradigm in pre-mRNA splicing is the recognition of the 5′ splice site (5′ss) by canonical base-pairing to the 5′ end of U1 small nuclear RNA (snRNA). We recently reported that a small subset of 5′ss base-pair to U1 in an alternate register that is shifted by 1 nucleotide. Using genetic suppression experiments in human cells, we now demonstrate that many other 5′ss are recognized via noncanonical base-pairing registers involving bulged nucleotides on either the 5′ss or U1 RNA strand, which we term “bulge registers.” By combining experimental evidence with transcriptome-wide free-energy calculations of 5′ss/U1 base-pairing, we estimate that 10,248 5′ss (∼5% of human 5′ss) in 6577 genes use bulge registers. Several of these 5′ss occur in genes with mutations causing genetic diseases and are often associated with alternative splicing. These results call for a redefinition of an essential element for gene expression that incorporates these registers, with important implications for the molecular classification of splicing mutations and for alternative splicing.

    Keywords

    Pre-mRNA splicing is an essential processing step for the expression of ∼90% of protein-coding human genes and relies on conserved sequence elements at both ends of introns, termed splice sites (Sheth et al. 2006; Wahl et al. 2009). These elements are highly diverse, considering that thousands of different sequences act as naturally occurring splice sites in the human transcriptome (Sahashi et al. 2007; Roca and Krainer 2009). The characterization of these sequence elements and the factors that recognize them has been essential for predicting exons in new genes, for the study of alternative splicing (Nilsen and Graveley 2010), and for classifying mutations in these elements that cause human genetic diseases (Buratti et al. 2007).

    Typically, the strength of a splice site (or splice site score) is estimated by algorithms that measure its concordance to matrices built using large collections of splice sites (Senapathy et al. 1990; Brunak et al. 1991; Yeo and Burge 2004; Sahashi et al. 2007; Hartmann et al. 2008). These methods implicitly assume that all of the sequences used to build the matrix are recognized by the same mechanism. However, there are cases in which the splice site score does not reflect the strength of the splice site determined experimentally (Roca and Krainer 2009), highlighting the limitations of these tools. Furthermore, the recognition mechanisms for many splice sites predicted to be weak are poorly understood.

    Splicing of >99% of pre-mRNA introns is catalyzed by the major spliceosome, a dynamic macromolecular machine composed of five small nuclear RNAs (snRNAs) and associated polypeptides, plus many other protein factors (Wahl et al. 2009). The U1 small nuclear ribonucleoprotein particle (snRNP), comprising the U1 snRNA and 10 polypeptides (Pomeranz Krummel et al. 2009), is the main component for early 5′ splice site (5′ss) recognition by the major or U2-type spliceosome. The vast majority of such introns (>99%) belong to the GT-AG (or GU-AG) category, as defined by their intronic terminal dinucleotides (Sheth et al. 2006). For >30 years, it has been firmly established that 5′ss are recognized by base-pairing to the 5′ end of U1 snRNA in a canonical register, defined as +1G at the 5′ss (the first intronic nucleotide) base-pairing to C8 of U1 (the eighth nucleotide of U1) (Fig. 1A, left; Lerner et al. 1980; Rogers and Wall 1980; Zhuang and Weiner 1986; Séraphin et al. 1988; Siliciano and Guthrie 1988). Thus, the 5′ss element spans the last 3 nucleotides (nt) of the exon and the first 8 nt of the intron, establishing a maximum of 11 base pairs (bp) to U1—although the contribution of the seventh and eighth nucleotides in the intron, which are much more variable, appears to depend on the species (Staley and Guthrie 1999; Freund et al. 2005). Later in spliceosome assembly, U1 is replaced by U6 snRNA, which forms a few base pairs to the 5′ss and is likely involved in catalysis (Wassarman and Steitz 1992; Kandels-Lewis and Séraphin 1993; Lesser and Guthrie 1993). In a handful of documented cases, U1 base-pairs at some distance from the 5′ss, and the cleavage site depends on subsequent U6 base-pairing (Cohen et al. 1994; Hwang and Cohen 1996; Brackenridge et al. 2003). There is also an example of a natural human U2-type intron whose splicing appears to be U1 snRNA-independent (Fukumura et al. 2009).

    Figure 1.

    Bulge 5′ss/U1 base-pairing. (A) Diagram of two base-pairing registers between consensus (left) and atypical (right) 5′ss and the 5′ end of U1 snRNA. 5′ss positions are numbered; base-pairing at +7 and +8 can contribute to splicing (Hartmann et al. 2008), although these positions do not show conservation in 5′ss compilations (Sheth et al. 2006). Consensus nucleotides are shown in red in all figures. (Ψ) Pseudouridine; (dot) 2,2,7-trimethylguanosine cap at the 5′ end of U1; (box) upstream exon; (line) intron. Base pairs in the canonical (C) or shifted (S) register are indicated by vertical lines. (B) Diagram of three base-pairing possibilities for the ITPR2 intron 18 5′ss: C (6 bp), S (9 bp), and bulge +2/+3 (B) registers (11 bp). (C) Mutational analysis for the B register. (Top) Schematic of the SMN1/2 minigenes, indicating the test 5′ss replacing the original 5′ss and the mutations introduced. Radioactive RT–PCR analysis is shown for SMN1 (top) and SMN2 (bottom panel) minigenes. The identity of the various spliced mRNAs is indicated on the left; from large to small, the bands correspond to exon 7 inclusion, exon 7 skipping, and exon 7 skipping with activation of a cryptic 5′ss at position −50 in exon 6. In all figures, the mean percentage and SD of exon 7 inclusion are shown below each lane. (D) Transition between shifted and bulge base-pairing. The nucleotides at positions −3 to −1 of the test 5′ss are indicated above each lane. (Lane 1) Atypical 5′ss (shifted register). (Lane 5) Bulge +2/+3 5′ss. Schematics for these 5′ss are in A and B. (Lanes 2–4) Intermediates between the atypical and the bulge +2/+3 5′ss. (E) Suppressor U1 experiments. The schematic shows, for the +5C and +6C mutant 5′ss, the suppressor U1s that restore base-pairing in the canonical (C; G4, or G3, respectively) or bulge (B; G5, or G4, respectively) register. The bulged nucleotide is depicted opposite to a gap in the bottom strand. The 5′ss mutation and suppressor U1s are indicated above each lane. (—) No suppressor.

    Two minor categories of U2-type splice sites have been known for a long time: GC-AG 5′ss (0.9%) and very rare AT-AC 5′ss (only 15 introns in the human genome) (Sheth et al. 2006). These 5′ss conform to consensus motifs very similar to the major U2-type GT-AG 5′ss and are recognized by analogous mechanisms. We recently showed that restoration of base-pairing to both U1 and U6 is essential to rescue recognition of a mutant AT 5′ss that causes aberrant splicing and myotonia (Kubota et al. 2011). U12-type introns are spliced by the minor spliceosome and are very rare as well (0.36%) (Will and Lührmann 2005).

    We recently showed that a small subset of GT-AG 5′ss, which we termed atypical 5′ss, is recognized by a base-pairing register with U1 that is shifted by 1 nt (+1G base-pairs to U1 C9 instead of C8) without changing the actual exon–intron boundary or the sequence of the spliced mRNA (Fig. 1A, right; Roca and Krainer 2009). In budding yeast, mutational analysis led to the suggestion that the noncanonical HOP2 5′ss is recognized by a base-pairing register involving a bulged nucleotide (Leu and Roeder 1999). A bulge in a strand of RNA (or DNA) duplex is defined as a nucleotide (or more) that is not opposed by any nucleotide on the other strand. Here we present extensive experimental evidence for multiple base-pairing registers between human 5′ss and U1, with bulged nucleotides on either RNA strand, and estimate that ∼5% of all 5′ss—present in ∼40% of human genes—use one of these noncanonical registers.

    Results

    A bulge 5′ss/U1 base-pairing register

    Inspection of well-annotated 5′ss in the human transcriptome revealed a group of bona fide 5′ss sequences similar to the atypical 5′ss recognized by shifted base-pairing (Roca and Krainer 2009) but differing at exonic positions −3 to −1 by having consensus nucleotides (Fig. 1B, representative 5′ss sequence from intron 18 of the ITPR2 gene). These 5′ss and U1 can establish 6 or 9 bp in the canonical or shifted registers, respectively. Because this sequence resembles the consensus 5′ss but with a U insertion at position +3, up to 11 bp could be formed by bulging out the U at either position +2 or +3 of the 5′ss. We tested whether this type of 5′ss is actually recognized by a “bulge +2/+3” register. Note that the proposed mechanism does not involve a shift in the sites of trans-esterification chemistry; i.e., the sequence of the spliced mRNA does not change.

    We analyzed the test 5′ss sequences in the context of Survival of motor neuron minigenes (SMN1/2) comprising a human genomic fragment from exon 6 to exon 8. SMN1 and SMN2 paralog pre-mRNAs show different extents of exon 7 inclusion due to a single-nucleotide difference at its sixth position: Exon 7 is completely included in SMN1 and predominantly skipped in SMN2 (Lorson et al. 1999). We replaced the natural 5′ss of this exon in the minigenes by the ITPR2 5′ss to test the bulge +2/+3 hypothesis and introduced point mutations to disrupt different base pairs (Fig. 1C).

    We transiently transfected HeLa cells with the different SMN1/2 minigenes and analyzed the extent of exon 7 inclusion—reflecting recognition of the test 5′ss—by radioactive RT–PCR. The wild-type test 5′ss resulted in nearly complete inclusion of exon 7 in either SMN1/2 context (Fig. 1C, lane 1, top and bottom). This observation indicates that the ITPR2 5′ss is stronger than the SMN1/2 5′ss. Point mutations on either side of the predicted bulged nucleotide partially or completely disrupted exon inclusion in either context, consistent with loss of base pairs to U1 (Fig. 1C, lanes 2–4). Combinations of these mutations exacerbated the disruption of exon 7 inclusion (Fig. 1C, lanes 5,6). The tested mutations disrupted base pairs in different registers, yet the bulge +2/+3 register was the only one severely affected by all of the mutations that resulted in exon skipping (the −1C mutation disrupts a very weak G·U wobble terminal base pair in the shifted register). This observation suggests that the test 5′ss is recognized via this bulge register.

    Next, we gradually converted an atypical 5′ss recognized by shifted base-pairing (from intron 8 of GTF2H1) (Roca and Krainer 2009) to a test 5′ss for the bulge +2/+3 register by introducing mutations at positions −3 to −1. Whereas the atypical and the bulge +2/+3 5′ss resulted in nearly complete SMN2 exon 7 inclusion (Fig. 1D, lanes 1,5, bottom), two of the intermediate mutants showed reduced inclusion (Fig. 1D, lanes 3,4, bottom), suggesting that these 5′ss cannot base-pair efficiently in either register. Furthermore, direct sequencing of the RT–PCR products from the various constructs confirmed that splicing only occurred at the GU boundary and not at the noncanonical UU 1 nt downstream (Supplemental Fig. S1A). This is in contrast to what was reported for a similar mutant 5′ss in FANCC (AUG/UUAAGUAG, where “/” is the mapped exon/intron boundary) (Hartmann et al. 2010), and consistent with a mutation that creates a 5′ss in SMARCAD1 (ACU/GUUAAGUAC), associated with autosomal-dominant adermatoglyphia (lack of fingerprints) (Nousbeck et al. 2011).

    We then used suppressor (or shift) U1 snRNA experiments to determine whether the test 5′ss is recognized via the bulge +2/+3 register (Fig. 1E). This powerful allele-specific suppression strategy has been used to prove base-pairing interactions between many 5′ss sequences and U1 in the canonical (Zhuang and Weiner 1986; Séraphin et al. 1988; Siliciano and Guthrie 1988) or shifted (Roca and Krainer 2009) register. We cotransfected the mutant SMN1/2 minigenes along with U1 snRNA expression plasmids carrying compensatory mutations that restore base-pairing in either the canonical or the bulge +2/+3 register. In SMN2, the loss of exon 7 inclusion due to the +5C mutation (Fig. 1E, lane 3, bottom) was partially rescued by U1 snRNAs with the bulge (B, G5) but not the canonical (C, G4) suppressor mutation (Fig. 1E, lane 5 vs. 4). Similarly, the +6C mutation (Fig. 1E, lane 6) was weakly rescued by the bulge (B, G4) but not the canonical (C, G3) suppressor in SMN1 (Fig. 1E, lane 8 vs. 7). Even though not all suppressor U1s are effective in such experiments, the suppressors in the bulge but not in the canonical register rescued recognition of the test 5′ss, thereby demonstrating that this 5′ss base-pairs to U1 in the bulge +2/+3 register.

    Registers with bulges at other 5′ss positions

    Next, we addressed the generality of bulge base-pairing; i.e., whether other natural 5′ss can be recognized by bulging out nucleotides at other positions. We found human authentic 5′ss sequences nearly identical to the consensus, but with an insertion at position +4 or at +5 (Fig. 2A). Such 5′ss are predicted to have 6 or 7 bp to U1 in the canonical register, which can be extended to 11 bp if a nucleotide at +4 or +5 is bulged, respectively. Thus, the extra base pairs in the bulge register would provide an energetic advantage over the canonical register. For the experiments below, the 5′ss base-pair poorly to U1 in the shifted register (Supplemental Fig. S1B), so for simplicity, comparisons are only made between the canonical and bulge registers.

    Figure 2.

    Base-pairing registers with bulges at positions +4 or +5. (A) Diagrams of the bulge +4 and bulge +5 base-pairing registers, as in Figure 1, A and B. The bulged nucleotide is numbered. (B) Mutational analyses in SMN1 (top panels) and SMN2 (bottom panels) minigene contexts. In B and D, RT–PCR products, mean exon inclusion, and SD are indicated as in Figure 1C. The additional bands are described in Supplemental Figure S2D. (C) Schematic of canonical (C; G3G10) and bulge (B; G4G10) suppressor U1s for the −2C+6C mutation in the bulge +4 register. In these experiments, suppressor U1s carry compensatory mutations for both 5′ss mutations. Other suppressors are shown in Supplemental Figure S1C. (D) Suppressor U1 snRNA experiments. The mutant test 5′ss and suppressor U1 are indicated at the top. Suppressor U1s in the canonical register were used in lanes 4, 8, 13, and 17. Suppressors in the bulge registers were used in lanes 5, 9, 14, and 18.

    We tested the bulge +4 and bulge +5 hypotheses in the SMN1/2 minigenes by mutational analysis and suppressor U1 experiments. The wild-type bulge +4 (from intron 6 of MORC4) and bulge +5 (from intron 8 of PARD3) 5′ss resulted in complete exon 7 inclusion with both minigenes (Fig. 2B, lanes 1,7). Among the mutations that disrupt base pairs on either side of the bulge, only the −1C mutants reduced exon 7 inclusion (Fig. 2B, lanes 2–4,8–10), indicating that these test 5′ss are quite strong. The +6C mutation, which disrupts a base pair only in the bulge registers, had no effect by itself (Fig. 2B, lanes 4,10). Nevertheless, +6C cooperated with −2C to reduce exon 7 inclusion and exacerbated the extent of skipping by the −1C mutation (Fig. 2B, lanes 6,11,12). The −2C mutation introduces a base pair in the shifted register, yet it is disruptive in combination with +6C, consistent with these 5′ss not being recognized by shifted base-pairing. The results with the combined mutations suggest that the +6 nucleotide base-pairs to U1, consistent with the bulge +4 or bulge +5 registers.

    Suppressor U1 experiments provided additional evidence for the bulge +4 or bulge +5 registers. The loss of exon 7 inclusion upon the −2C+6C mutation in the bulge +4 5′ss was rescued by the suppressor in the bulge but not in the canonical register (Fig. 2C,D [lane 9 vs. 7,8]). The rescue of the bulge +5 5′ss with the −2C+6C mutation was higher with the bulge than with the canonical suppressor in SMN2 (Fig. 2D, lane 18 vs. 16,17). We conclude that the test 5′ss are preferentially if not exclusively recognized by the bulge +4 or bulge +5 registers, respectively.

    We also tested a more complex case involving a 5′ss sequence predicted to base-pair to U1 by bulging out an adenosine at either position +3, +4, or +5 (from intron 2 of JMJD6). Mutational analyses and suppressor U1 experiments demonstrated that such 5′ss are indeed preferentially recognized by the bulge +3/+4/+5 register (Supplemental Fig. S2).

    Registers with bulges at the 5′ end of U1

    Our data show that the 5′ss/U1 helix can tolerate bulges on the 5′ss strand to support productive splicing. Other 5′ss (example from intron 5 of PRKD1) can form 7 bp to U1 in the canonical register, which can be increased to 10 bp by bulging out one of the two consecutive pseudouridines (Ψ, a uridine isomer) at positions 5 and 6 of the 5′ end of U1 (Fig. 3A). We tested the bulge Ψ hypothesis by mutational and suppressor U1 analyses. The wild-type test 5′ss supported efficient SMN1/2 exon 7 inclusion (Fig. 3B, lane 1), which was partially or totally disrupted by point mutations on either side of the bulge (−1C or +5C), independently or combined (Fig. 3B, lanes 2,3,6). The suppressor U1s in the bulge but not the canonical register rescued exon inclusion disrupted by the +5C and −1C+5C mutations (Fig. 3B, lane 4 vs. 3,5 in both SMN1 and SMN2, and to a lesser extent, lane 9 vs. 6,10 in SMN1), demonstrating that these 5′ss base-pair to U1 in the bulge Ψ register.

    Figure 3.

    The bulge Ψ base-pairing register. (A) Schematic for the bulge Ψ register and for suppressor U1s for the +5C mutation (other suppressors are shown in Supplemental Fig. S1D). The U1 G4 and U1 G3 suppressors restore base-pairing in the canonical (C) or bulge (B) register, respectively. (B) Mutational analyses and suppressor U1 experiments demonstrate the bulge Ψ register. For the −1C+5C mutation, the G3G9 and G4G9 suppressors restore 2 bp in the bulge and canonical registers, respectively. RT–PCR products, mean exon inclusion, and SD are indicated as in Figure 1C.

    Finally, we used U1-specific RNA decoys (Roca and Krainer 2009) to confirm that all of the tested 5′ss in the SMN1/2 contexts are indeed recognized by U1 and not by other U1-like snRNAs (Kyriakopoulou et al. 2006). These vector-driven RNA decoys sequester endogenous U1, resulting in the loss of 5′ss recognition and, consequently, in SMN1/2 exon 7 skipping. We detected exon 7 skipping upon cotransfection of the U1 decoy but not a control decoy with a mismatch, confirming that the recognition of the test 5′ss is mediated by U1 snRNA (Supplemental Fig. S3).

    Bulge registers in their natural context

    The above results demonstrate that bulge 5′ss/U1 base-pairing can occur in the context of mutant SMN1/2 minigenes; next, we tested whether these registers are used in a natural context. We selected five representative examples from our genomic searches (see below) and constructed three-exon/two-intron minigenes with the test 5′ss as part of the middle exon. The registers tested included bulge +3 (F5 minigene), bulge +4 (RPGR minigene), bulge +5 (MLH1 minigene), bulge +3/+4/+5 (RB1 minigene), and bulge Ψ (MDM2 minigene).

    We then performed mutational analyses and suppressor U1 experiments (Fig. 4; Supplemental Figs. S4, S5). 5′ss point mutation alleles of MLH1 and RPGR—associated with colorectal cancer and retinitis pigmentosa, respectively—resulted in reduced exon inclusion compared with wild-type minigenes (Fig. 4A [lane 2 vs. 1], B [lanes 2,3 vs. 1]). Mutations that disrupted the bulge also resulted in skipping of the middle exon (Fig. 4A [lanes 5,6], B [lanes 4,7,8,11]). Suppressor U1s in the bulge but not the canonical register rescued correct splicing (Fig. 4A [lane 8 vs. 6,7], B [lane 10 vs. 8,9]). Analogous experiments with the F5, RB1, and MDM2 minigenes gave similar results (Supplemental Fig. S4). We conclude that the five representative 5′ss are recognized via the predicted bulge base-pairing registers.

    Figure 4.

    Bulge base-pairing registers in their natural context. (A) Bulge +4. (Top) Schematic of the RPGR minigene, indicating the bulge base-pairing register and some of the point mutations introduced. (Bottom) Mutational analysis and suppressor U1 experiments. The mutant 5′ss and suppressor U1 are indicated above each lane. RT–PCR products are indicated on the left and correspond to inclusion (1) and skipping (2) of the middle exon. All point mutations disrupted exon 4 inclusion (lanes 2–6) and the U1 suppressor for the +7C mutation in the bulge (G3) but not the canonical (G2) register rescued inclusion (lanes 6–8). (A,B) Mean exon inclusion values and SD are shown as in Figure 1C, and diagrams for the various U1 suppressors are shown in Supplemental Figure S4A. (B) Bulge +5. Schematic of the MLH1 minigene, with labels as in A. RT–PCR products as in A. All point mutations disrupted exon 10 inclusion (lanes 2–4,7,8), and the +7C mutation was rescued by the bulge (lane 10) and not by the canonical U1 suppressor (lane 9).

    Genomic analyses of bulge 5′ss/U1 base-pairing

    Our experiments demonstrate that 5′ss positions +2 to +5 and the Ψ at U1 position 5 or 6 can be bulged in certain 5′ss/U1 RNA helices. We next addressed the following issues about bulge 5′ss/U1 base-pairing: (1) the number of authentic human 5′ss recognized via these noncanonical registers, (2) whether other bulge registers occur, (3) the energetic advantage of the bulge over the canonical register, (4) the implications for disease-causing mutations and single-nucleotide polymorphisms (SNPs) at 5′ss, and (5) the involvement in alternative splicing.

    We used a bioinformatics approach to address these questions. We generated a data set of 201,541 human authentic (well-annotated) 5′ss sequences, including 15 nt on each side of the exon/intron junction, for both constitutive and alternative exons. For each sequence and the 5′ end of U1, we estimated the base-pairing register and minimum free energy (ΔG1, in kilocalories per mole) using UNAFold hybrid (Markham and Zuker 2008). In a second run of UNAFold, we obtained the free energies for these 5′ss by forcing canonical base-pairing (ΔG2) (see the Materials and Methods). UNAFold predicted a total of 10,248 5′ss (5.1%) that base-paired to U1 using a bulge register, which we termed “bulge 5′ss.” The bulge 5′ss occurred in 6577 genes, amounting to 41% of the total of 15,894 genes in our data set. The energetic advantage of the bulge over the canonical register (ΔΔG) was calculated as the difference between ΔG1 and ΔG2 and ranged from −0.1 to −4.9 kcal/mol. To identify the subset of bulge 5′ss in which the bulge register confers a substantial energetic advantage, we selected those cases with a ΔΔG ≤ −1, based on our SNP analysis (see the Results below and the Materials and Methods).

    The bulge 5′ss set contains 6940 5′ss (3.4% of all 5′ss) using a base-pairing register with one bulged nucleotide (Table 1). This includes the registers for which we obtained the experimental evidence above and new ones, including a bulge at position −1 (see the Discussion), a bulge at either position +3/+4 or +4/+5, and a bulge at GC 5′ss, including the C at position +2.

    Table 1.

    Numbers and distribution of bulge 5′ss

    In addition to single-nucleotide bulges, UNAFold also predicted many registers involving longer bulges at the 5′ss, ranging from 2 to 8 nt (Table 1). These registers have not been experimentally tested yet, but they would account for the recognition of 3294 5′ss (1.6%). The number of candidates and the ΔΔG became smaller as the bulge length increased (Supplemental Fig. S6). Finally, rare registers included a bulge of both pseudouridines in U1 (only conferring a marginal ΔΔG of −0.1), and two bulges at distant 5′ss positions.

    By searching the splice site database SpliceRack (Sheth et al. 2006), examples of bona fide 5′ss recognized by our experimentally validated bulge registers were also found in Mus musculus, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana (Supplemental Table S1). This conservation suggests that bulge 5′ss/U1 base-pairing is a general and phylogenetically widespread phenomenon.

    To summarize our predictions, from the 10,248 bulge 5′ss (Table 1), a total of 3016 5′ss (1.5% of all 5′ss) use a single-nucleotide bulge register consistent with the experimental evidence in our model minigenes. Furthermore, 5877 5′ss (2.9%) have a bulge register that confers a substantial energetic advantage (ΔΔG ≤ −1) over the canonical register. We conclude that bulge 5′ss/U1 base-pairing is a frequent phenomenon, affecting the recognition of 1.5%–5.1% of all human 5′ss.

    Tm measurements support bulge base-pairing

    We carried out oligonucleotide duplex melting experiments to further test the formation of bulge 5′ss/U1 helices and confirm the reliability of the UNAFold predictions. For the 5′ss strand, we designed a set of 45 oligoribonucleotides, which included sequences for calibration, and 18 test pairs for several bulge base-pairing registers (Table 2). The calibration oligonucleotides included two consensus 5′ss (consensus and consensus-long, with 9 and 11 bp to U1, respectively), five related sequences with mutations (consensus mut), and two sequences with very poor match to the consensus (low). The test pairs included a “bulge” 5′ss sequence (B) predicted to base-pair to U1 with bulged nucleotides, and a control sequence (C) with a single-nucleotide difference predicted to abolish the bulge (Supplemental Fig. S7). For the complementary U1 strand, we synthesized two U1 oligoribonucleotides: one with an unmodified 11-nt 5′ end, and the other with the two Ψs plus the 2′-O-methyl nucleotides at the first two positions, characteristic of U1 snRNA.

    Table 2.

    Summary of oligonucleotide melting experiments

    For each oligonucleotide duplex (5′ss and U1), we obtained melting curves and experimentally derived Tm and ΔG (ΔGexp) values (Table 2; see the Materials and Methods). Overall, the ΔGexp correlated very well with the predicted ΔG1 (R = 0.92 for the modified U1 oligonucleotide). The Tm values for the helices with the modified U1 oligonucleotide were, on average, 3.4°C (±1.6°C) higher than those for the unmodified one, which is in the range of that previously described for a consensus 5′ss (2°C) (Hall and McLaughlin 1991). Analysis of a partially modified oligonucleotide with only the two Ψs indicated that ribose methylation did not contribute to the enhanced stability (data not shown). Many 5′ss oligonucleotides did not show a cooperative hyperchromic shift with increasing temperature, suggesting that these RNAs did not base-pair to U1 under these experimental conditions (Table 2; Supplemental Fig. S7).

    Interestingly, the seven pairs of 5′ss RNAs that showed cooperative transitions had a consistent trend: The B oligonucleotide had a higher Tm and more stable ΔGexp than the C oligonucleotide, such as for the bulge +2/+3 pair (Table 2; Supplemental Fig. S7). From these oligonucleotide pairs and modified U1, we derived a mean difference in ΔGexp of −1.2 kcal/mol (±0.8). In other cases, the B but not the C oligonucleotide showed a cooperative transition and ΔGexp, such as for the bulge +5 pair. This trend was seen in 11 and eight pairs with the unmodified or modified U1 oligonucleotide, respectively. This finding also indicates that the bulge oligonucleotides base-paired more stably to U1 than the controls.

    In summary, the 15 informative pairs showed that the bulge oligonucleotide bound to the modified U1 oligonucleotide with higher affinity than the control (Table 2). These thermodynamic data strongly support the notion that bulged nucleotides can occur in the context of 5′ss/U1 helices.

    Implications for disease mutations and SNPs

    We next asked whether the bulge 5′ss have previously documented mutations causing human genetic diseases or have SNPs. From a set of 581 disease-causing mutations at 5′ss (Roca et al. 2008), we found that 24 (4.1%) mapped to bulge 5′ss (Supplemental Table S2). As expected, the ΔG1 values for the mutant 5′ss are substantially smaller than those for the corresponding wild-type 5′ss (mean difference of −1.7 kcal/mol). However, the mutation disrupted the bulge structure in only one case. Although a larger data set will be needed to establish statistical significance, these observations suggest that deleterious mutations at bulge 5′ss tend to disrupt a base pair not involved in the bulge.

    From a set of 1,116 SNPs at human 5′ss (Roca et al. 2008), 57 (5.1%) mapped to bulge 5′ss (Supplemental Table S3). The ΔG1 values for either 5′ss variant are similar (mean absolute difference [MAD] = −1.0 kcal/mol), consistent with a generally neutral effect of the SNP on 5′ss strength. In 34 cases, the bulge structure is maintained by the SNP, resulting in small differences in ΔG1 between the two 5′ss variants (MAD = −0.6 kcal/mol). In the remaining 23 cases, one allele base-pairs to U1 with bulged nucleotides, and the other allele base-pairs in a canonical register, resulting in a larger ΔG1 difference (MAD = −1.7 kcal/mol). In eight out of 23 cases, the 5′ss variant with the bulge structure had a less stable ΔG1 than the variant using the canonical register because the SNP introduces an extra base pair in the canonical register. For these eight SNPs, the difference in ΔG1 between the two 5′ss variants would be −3.5 kcal/mol larger, on average, if the weaker allele did not base-pair in a bulge register, suggesting a compensatory effect of the bulge structure. These findings suggest that SNPs at bulge 5′ss have a low overall impact on 5′ss strength, even if the bulge structure is not conserved.

    Involvement in alternative splicing

    Finally, we investigated whether the bulge base-pairing registers tend to occur preferentially in 5′ss involved in alternative splicing. To this end, we subdivided the set of 5′ss into four categories: constitutive, alternative 5′ss (choice between at least two tandem 5′ss), alternative 3′ss (5′ss in exons with tandem 3′ss), and cassette exons (exons that are included or skipped) (Table 3). Compared with the canonical 5′ss, the bulge 5′ss were significantly enriched in the alternative 5′ss set (4.17% vs. 2.10%, P < 10−28, Fisher's exact test) and cassette exon set (8.15% vs. 6.98%, P < 10−5). These differences remained when we used the bulge 5′ss with a ΔΔG ≤ −1 (4.23% vs. 2.10% for alternative 5′ss, P < 10−18).

    Table 3.

    Bulge base-pairing and alternative splicing

    Splice sites involved in alternative splicing are slightly weaker, on average, than constitutive splice sites (Itoh et al. 2004; Wang et al. 2005), and we confirmed this trend in our data set of canonical 5′ss (ΔG1 for 5′ss in constitutive and alternative splicing are −7.88 and −7.53 kcal/mol, respectively, P < 10−5, one-sample t-test). As the mean ΔG2 for the bulge 5′ss is much less stable than that for the canonical 5′ss (−5.76 vs. −7.88 kcal/mol) (Table 3), the enrichment of the bulge 5′ss in alternative splicing events could be biased by the overall weakness of these 5′ss. Thus, we derived a subset of “weak canonical 5′ss” (see the Materials and Methods) with a mean ΔG2 almost identical to that for the bulge 5′ss (−5.81 vs. −5.76). Comparison of the bulge 5′ss with the weak canonical 5′ss revealed that the enrichment for cassette exons essentially disappeared (8.15 vs. 7.90, P < 0.14, Fisher's exact test), but the enrichment for alternative 5′ss remained highly significant (4.17 vs. 2.62, P < 10−12). This small yet statistically significant bias reveals that the bulge 5′ss, when compared with canonical 5′ss, are more frequently associated with alternative 5′ss events.

    Discussion

    We experimentally demonstrated that many 5′ss are recognized by U1 snRNA via base-pairing registers involving bulged nucleotides. We also estimated that 1.5%–5.1% of all naturally occurring human 5′ss are recognized by these noncanonical registers. Indeed, 6577 genes have at least one bulge 5′ss, which represents 41% of the total number of genes in our data set. These predictions strongly suggest that bulge base-pairing is much more prevalent than shifted base-pairing (only 59 cases) (Roca and Krainer 2009) and that bulge 5′ss are considerably more abundant than noncanonical or GC 5′ss (0.9%) and minor spliceosome (U12-type) 5′ss (0.36%) (Sheth et al. 2006). Bulge 5′ss/U1 base-pairing appears to occur in a wide range of species, likely including one example in budding yeast (Leu and Roeder 1999), which has a small number of U2-type introns with a very highly conserved 5′ss consensus motif and lacks U12-type introns. Furthermore, we estimate that 2.9% of all human 5′ss use a bulge register that confers a substantially lower free energy than the canonical register (ΔΔG ≤ −1). On the other hand, a small ΔΔG indicates that the bulge helix is roughly as stable as the nonbulge helix. Thus, many 5′ss can be recognized by either canonical or bulge base-pairing at relative levels that can be estimated by ΔΔG and a partition function (Huang et al. 2009).

    The formation of 5′ss/U1 helices with single-nucleotide bulges is strongly supported by experimental and computational methods (Table 1). The only exception is the bulge −1 register (2913 candidates), which did not pass the mutational analysis test in SMN1/2 (data not shown). Nevertheless, 1744 candidate 5′ss for the bulge −1 register have a ΔΔG ≤ −1, and the Tm data for bulge −1 was consistent with bulge base-pairing (Table 2). Furthermore, even if the bulge at position −1 does not form in SMN1/2, we cannot exclude the possibility that it does so in its natural contexts. For these reasons, we kept this register in our predictions.

    For single-nucleotide bulges, all of the bulged positions at the 5′ss (+2 to +5, and perhaps −1) or at the 5′ end of U1 (only the Ψ at 5 or 6) are limited to the middle of the helix. This observation reflects an energetic requirement for the bulged nucleotide to be flanked by a sufficient number of base pairs. Furthermore, most of the 5′ss positions that are bulged out in the helix are clustered opposite or close to the two Ψs of U1. Ψs in RNA helices establish an additional water bridge with the phosphate backbone and stabilize base-stacking in general (Arnez and Steitz 1994; Davis 1995) and specifically in the context of consensus 5′ss/U1 helices (Hall and McLaughlin 1991). Our Tm data also showed that Ψs strengthen 5′ss/U1 helices, but we could not detect an additional role in bulge structures (Table 2). The Ψ in the U2 snRNA/branch point sequence helix stabilizes base-stacking around the bulged adenosine in the pre-mRNA in addition to placing the bulge in an extrahelical conformation (Lin and Kielkopf 2008). Thus, it is possible that these modified nucleotides contribute to the stability of the 5′ss/U1 bulge structure.

    Our predictions also include registers with longer bulges, ranging from 2 to 8 nt. These registers usually confer a smaller energetic advantage over the canonical register (Supplemental Fig. S6), likely reflecting distortions of the RNA helix. These distortions include a kink, with the bending angle increasing with the length of the bulge (Bhattacharyya et al. 1990; Gohlke et al. 1994), and the loss of base-stacking interactions with bulges longer than 1 nt (Znosko et al. 2002). In any case, a total of 1496 such cases have a ΔΔG ≤ −1, arguing that bulges longer than 1 nt may occur in certain 5′ss/U1 helices.

    We found that disease-causing mutations do not affect the bulge structure, but SNPs usually do, yet the overall 5′ss strength is roughly conserved. This suggests that the disruption of base pairs introduced by the bulge register has a less severe impact on 5′ss strength than the disruption of base pairs in common with the canonical register. Furthermore, the bulge 5′ss are proportionaltely more often involved in alternative splicing than the canonical 5′ss. The selection of alternative 5′ss is influenced by their relative strengths; other cis-elements, such as exonic or intronic splicing enhancers or silencers; and trans-acting protein factors (Eperon et al. 2000; Yu et al. 2008). Our data suggest that, in general, the additional base pairs enabled by bulging result in a subtle increase in 5′ss strength, which might be important to fine-tune alternative splicing.

    An important implication of bulge base-pairing is that the length of the 5′ss motif increases with the length of the bulge, such that some 5′ss span >11 nt. Most of the current 5′ss scoring methods only consider 9 nt or, in some cases, 11 nt (Senapathy et al. 1990; Yeo and Burge 2004; Sahashi et al. 2007; Hartmann et al. 2008), thereby omitting important information that can contribute to 5′ss strength. Furthermore, the longer the 5′ss motif, the more likely it is that U1 base-pairing can compete with proteins binding at overlapping sites and/or with pre-mRNA secondary structures (Warf and Berglund 2010). Such competition scenarios could limit or regulate the formation of bulge structures in cells. Finally, it is also possible that the 5′ss/U1 bulge structures are positively or negatively influenced by RNA-binding proteins bound at nonoverlapping exonic/intronic splicing enhancers/silencers.

    In addition to U1 snRNA, there are proteins that bind the 5′ss and influence splicing, such as U1C (a component of U1 snRNP), which can bind the 5′ss even in the absence of the U1 5′ end (Du and Rosbash 2002). We found that bulge 5′ss/U1 base-pairing resulted in productive splicing, suggesting that binding of these proteins—if it is indeed necessary—can tolerate the distortion of the helix induced by the bulge. The crystal structure of the U1 snRNP (Pomeranz Krummel et al. 2009) revealed that U1C contacts the minor groove of the base pairs established by U1 nucleotides C8 and A7, which are maintained in almost all of the bulge registers we describe (except GC 5′ss). The diversity of footprint patterns for the U1 snRNP onto 5′ss sequences (Yu et al. 2008) is also consistent with a dynamic and flexible interaction that can presumably accommodate small distortions in the RNA double helix, such as a bulge. Furthermore, U6 snRNA could potentially base-pair to our test 5′ss in bulge registers as well. U6 and the human consensus 5′ss can only form five Watson-Crick base pairs, suggesting modest energetic requirements for the 5′ss/U6 helix. Since this helix is at the catalytic core of the spliceosome (Rhode et al. 2006), changes in the U6 base-pairing register could affect the selection of the correct trans-esterification site during the first step of splicing. It is also possible that U6 binds to the bulge 5′ss in a canonical register, as suggested for the atypical 5′ss that are recognized by U1 in a shifted register (Roca and Krainer 2009).

    The multiple 5′ss/U1 base-pairing registers help explain the efficient recognition of many authentic 5′ss otherwise predicted to be weak. In turn, these additional registers increase the number of pseudo-5′ss in exons and especially introns. Pseudo-5′ss are sequences that conform to the 5′ss motif but are not used for splicing (Sun and Chasin 2000), although at least some of them can have functional roles. First, large introns might be spliced in a stepwise manner using internal splice sites, as shown in flies (Hatton et al. 1998) and, in one case, humans (Parra et al. 2008). Second, U1 binding at intronic sites can repress the inclusion of pseudoexons—intronic fragments resembling exons—in the mRNA (Buratti and Baralle 2010). Third, the recently discovered role of U1 in preventing premature polyadenylation at intronic cryptic poly(A) sites (Kaida et al. 2010; Vorlová et al. 2011) might further increase the need for the recognition of intronic pseudo-5′ss by U1. As a corollary, the diversity of 5′ss/U1 registers presumably increases the importance of cis-elements and trans-factors other than U1 for proper discrimination between authentic and pseudo-5′ss.

    We showed that a substantial fraction of 5′ss are recognized by U1 using noncanonical base-pairing registers involving bulged nucleotides and that the use of these registers increases 5′ss strength. These findings further highlight the flexibility of the interaction between the 5′ss and the 5′ end of U1, allowing for many base-pairing registers to support efficient splicing. Importantly, these registers are not considered by any of the current splice site scoring methods. Thus, the characterization of these registers should allow the development of more accurate—albeit more complex—prediction tools, with important implications for the molecular classification of splicing mutations and SNPs and for the study of alternative splicing.

    Materials and methods

    Minigene cloning

    We used the U1 expression plasmids and decoys as well as the SMN1/2 minigenes in the pCI vector as described (Roca and Krainer 2009). We amplified the MLH1, RPGR, F5, MDM2, and RB1 fragments from human genomic DNA and subcloned them into the pcDNA3.1+ vector (Invitrogen). We internally deleted introns 3 and 4 of RPGR, intron 11 of MDM2, and introns 3 and 4 of RB1 to leave only 225 nt at each end. Likewise, we deleted introns 9 and 10 of MLH1 and introns 22 and 23 of F5 to leave 250 nt and 200 nt at each end, respectively. We incorporated the designed mutations into the minigenes by PCR mutagenesis.

    Minigene transfection

    We cultured and transfected HeLa cells in six-well plates using FuGENE 6 (Roche Diagnostics) as described (Roca and Krainer 2009). The transfection mixture included 80 ng of EGFP-N1 (Clontech), 80 ng of the splicing minigene, and 840 ng of control (pcDNA3.1+ or pUC19) or suppressor U1 plasmid.

    RNA extraction, reverse transcription, and PCR

    We performed RT–PCR analyses as described (Roca and Krainer 2009). Briefly, 48 h post-transfection, we extracted total RNA from cells using TRIzol (Invitrogen). We treated the RNA with RQ1 DNase (Promega), and reverse-transcribed it using SuperScript II RT (Invitrogen) and oligo-dT. We amplified the resulting cDNAs by PCR using vector-specific primers. We 5′-end-labeled one of the PCR primers using T4 polynucleotide kinase (New England Biolabs) and γ-32P-ATP. We performed 23 cycles of PCR, ensuring exponential amplification. We separated the PCR products by 6% native PAGE and quantified the band intensity by PhosphorImage analysis. We performed RT–PCRs from three independent transfections to derive the mean percentage of inclusion for each experiment. In all cases, the standard deviations were ≤5%, allowing comparison of exon inclusion percentage values between experiments. We determined the identity of each PCR product by subcloning agarose gel-purified bands with an Original TA Cloning kit (Invitrogen) followed by sequencing. We also directly sequenced the RT–PCR products in Figure 1D to test for potential splicing at a UU 5′ss dinucleotide, as reported for 5′ss similar to the atypical 5′ss (Hartmann et al. 2010).

    Tm analysis

    The RNA oligonucleotides were synthesized at IDT (Integrated DNA Technologies), purified by RNase-free HPLC, and confirmed by mass spectrometry. Before use, we checked the oligonucleotides by LC-MS under denaturing conditions (tributylamine in 70% [v/v] acetonitrile). We diluted the oligonucleotides in Tm buffer containing 100 mM NaCl, 10 mM NaPO4, and 0.1 mM EDTA (at pH 7). We mixed each oligonucleotide (5′ss and U1) at a final concentration of 8 μM, based on the absorbance at 85°C and extinction coefficients (Puglisi and Tinoco 1989). We heated each oligonucleotide pair for 5 min at 95°C and allowed it to cool for 2 h to room temperature. We measured the absorbance at 260 nm as a function of increasing temperature using a Cary 100 Bio UV-Visible spectrophotometer, heating each duplex at a rate of 0.5°C/min in 1-cm path length cells, followed by cooling to confirm reversibility and lack of evaporation. We obtained the Tm and ΔGexp from the absorbance versus temperature curves using CaryWinUV software (Agilent Technologies).

    In silico analyses

    We compiled an updated database of naturally occurring, well-annotated 5′ss in the human genome, spanning 30 nt with the exon–intron junction in the middle (nucleotides 15/16). We assembled this 5′ss collection using different databases, including SpliceRack (Sheth et al. 2006), dbCASE (Zhang et al. 2007), and RefSeq (Pruitt et al. 2009). We removed redundant sequences, noncanonical 5′ss, and U12-type 5′ss to obtain all U2-type GT-AG and GC-AG 5′ss.

    We used these 5′ss sequences and the 5′ end of U1 as inputs for the UNAFold hybrid tool (Markham and Zuker 2008), which calculated the most stable intermolecular base-pairing structure and minimum free energy for each 5′ss. UNAFold predictions are based on empirical thermodynamic parameters, known as nearest-neighbor energy rules (Mathews et al. 1999). These rules include energetic parameters for bulged nucleotides but not for Ψs. The first UNAFold run—without restrictions—returned the most stable structure and ΔG1 (in kilocalories per mole). From these predictions, we only considered those sequences with registers for 5′ss +1G (G16 on the 30-nt sequence) base-paired to C8 of U1 snRNA. We obtained the ΔG2 values for canonical base-pairing by forcing +1G of the 5′ss to base-pair to C8 of U1, and not allowing bulges, using the maximum asymmetry option (maxas = 0). The energetic advantage of the bulge over the canonical register was estimated as ΔΔG = ΔG1 − ΔG2. We selected a ΔΔG cutoff of ≤−1 to identify the bulge 5′ss in which the bulge structure confers a substantial advantage. This value was derived from the mean absolute difference between bulge 5′ss variants carrying SNPs, as in general, both 5′ss variants should be equally functional. Finally, we classified the various bulge 5′ss into different categories, depending on the length of the bulge and the position of the bulged nucleotides.

    We cross-analyzed the bulge 5′ss with updated data sets of disease-causing mutations and SNPs at human 5′ss that fall outside of the invariant dinucleotide at positions +1/+2 (Roca et al. 2008). We assessed the conservation of the bulge structure by running UNAFold for the 5′ss with either the mutated nucleotide or the SNP allele that is not in the reference genome sequence. For each disease-causing mutation, we derived the ΔG1 difference as ΔG1(wild-type) − ΔG1(mutant). For SNPs, because both alleles are presumably functional, we calculated the MAD (in negative numbers) as −|ΔG1(allele 1) − ΔG1(allele 2)|. We then calculated means and standard deviations.

    Using the annotations in dbCASE (Zhang et al. 2007), we derived the frequency of each alternative splicing event for internal exons in the canonical 5′ss and bulge 5′ss data sets. We defined a “weak canonical 5′ss” data set as the subset of “canonical 5′ss” with ΔG ≤ −7 (49,762 5′ss). We obtained P-values using Fisher's exact test or one-sample t-test.

    Acknowledgments

    We thank Zuo Zhang, Fred Allain, Pierre Barraud, David Horowitz, and Fedor Karginov for advice; Chaolin Zhang for providing the 5′ss data in dbCASE; and Michael Zuker for implementing the maximum asymmetry option in UNAFold. X.R., M.A., and A.R.K. acknowledge support from NIH grant GM42699.

    Footnotes

    • Received February 20, 2012.
    • Accepted April 10, 2012.

    References

    | Table of Contents

    Life Science Alliance