Rapid evolutionary turnover underlies conserved lncRNA–genome interactions

In this study, Quinn et. al. used an integrative strategy based on matching focal and repeated RNA secondary structures and other RNA features that uncovers novel lncRNA orthologs despite limited sequence similarity. This method was applied to Drosophilia roX1 and roX2 RNAs, and 47 new roX RNAs across ∼40 million years of evolution were discovered.

and the concestor of the willistoni-saltans clade. The D. melanogaster roX2 locus is shown for reference.
(B) Sequence identity between roX2 and roX3 in five species. The roX2 and roX3 orthologs share relatively high sequence identity both between species and between roX2-roX3 pairs, indicating that roX2 and roX3 are likely paralogs that resulted from a whole gene duplication event.
(C) roX2 and roX3 RNA expression. roX2 and roX3 orthologs are male-biased transcripts in the willistoni-saltans clade, as is roX2 in H. duncani. Expression bias between roX2 and roX3 is species-specific; for example, roX2 expression is nearly undetectable in D. paulistorum, yet roX2 expression is higher than roX3 in D.
nebulosa. GPDH is used as a sex-independent control.
(E) The stem-loop structure at the 5'-end of roX2 and roX3 orthologs is conserved between species and between roX2-roX3 paralogs. This structure contains RB1 (red circles). H. duncani roX2 has an extended stem-loop.
(F) ChIRP-seq was used to map the chromatin occupancy of the roX RNAs in D.
willistoni. The roX2-roX3 locus is shown with signal from roX1 and roX2/3 ChIRPseq and input DNA. The roX2 and roX3 loci exhibit a very similar pattern of roX binding, indicating that the HAS at these paralogous loci are also conserved. (A) In D. melanogaster, roX1 is flanked by protein-coding genes yin (upstream, sense) and ec (downstream, antisense). In nearly all other species, these flanking genes are maintained in position and orientation; one such exception is D. ananassae, in which an intrachromosomal rearrangement replaced the downstream neighboring gene ec with CG5254. For brevity, only representative species from major fly clades are shown. Scale bars, 5kb.
(B) Similarly, in D. melanogaster, roX2 is flanked by protein-coding genes CG11695 (upstream, sense), e(y)2 (upstream, antisense), and nod (downstream, sense). Synteny is not always perfectly preserved, however; though nod is downstream of roX2 in the melanogaster subgroup, ari-1 is downstream in all other flies outside the melanogaster subgroup. Thus, the roX2-ari-1 synteny block is ancestral, and an intrachromosomal shuffling event in the ancestor of the melanogaster subgroup replaced ari-1 with nod at the roX2 locus. For this reason, when searching for roX2 orthologs in species outside the melanogaster subgroup, we instead used ari-1 as a flanking gene instead of nod. Using nod in any search outside of the melanogaster subgroup would prove unproductive, since this syntenic relationship is not maintained. This instance highlights the importance of using phylogenetic relationships as a guide in the search strategy. When performing the synteny search in species lacking WGS, such as D.
nasuta, we anticipated that D. nasuta might have similar gene order to its closest relative with a sequenced genome, D. albomicans. We designed degenerate primers at evolutionarily conserved sites in the four flanking protein-coding genes (ec and yin, and e(y)2 and ari-1, for the roX1 and roX2 loci, respectively). In both cases, PCR yielded a fragment of roughly the expected intervening size, which we then sequenced.
Supplemental Figure S3. The lncRNA ortholog search strategy found the HOTAIR locus in 43 diverse vertebrate species.
(A) HOTAIR is a lncRNA that was discovered in human and has been described in mouse. It is transcribed from the HOXC locus, flanked by and antisense to the protein-! 3 coding genes HOXC11 and HOXC12, which are highly conserved across vertebrate genomes.
(B) We searched for the HOTAIR locus in 43 diverse vertebrate species from primates down to zebrafish (which diverged ~400Mya). We used the synteny module of the lncRNA ortholog search strategy, initiating with knowledge of only human HOTAIR.
We found HOXC11 and HOXC12 on the same genomic scaffold in a window of ~21kb, suggesting that the syntenic relationship with HOTAIR is maintained. In at least six species for which there are expressed sequencing tags (ESTs), there was an EST in the intergenic space between HOXC11 and HOXC12, mapping to the location where HOTAIR would be expected. This suggests that an intergenic lncRNA -presumably the HOTAIR ortholog -is encoded at this locus. K, known HOTAIR lncRNA; Y (cyan), HOTAIR lncRNA ortholog candidate identified.
(C) Using the motif discovery algorithm MEME, we searched for instances of microhomology in the putative HOTAIR loci. At least six instances of focal microhomology were found, represented as colored boxes and corresponding to specific positions mapped to human HOTAIR. All six elements (red through purple) are conserved across all eutherian mammals, while one (yellow) has much deeper evolutionary conservation and can be found in zebrafish. Despite monotremes' closer relation to eutherian and metatherian mammals, many of the sequence elements found in HOTAIR are absent. Pale boxes with dotted outline indicate species for which not all elements within a microhomologous sequence motif are present (i.e. incomplete microhomology).
(D) Sequence motifs for the conserved HOTAIR elements. One motif (red) is found at the promoter for the human HOTAIR lncRNA, and its conservation suggests that this promoter is conserved in eutherian and metatherian mammals. Similarly, a splice site (cyan) is conserved in eutherian and metatherian mammals. Together, the conservation of these transcription-and splicing-associated signals suggests that this locus in other species is also transcribed and spliced. Whether these are microhomologous elements are functional at the DNA level (e.g. transcription factor binding sites, enhancers, etc.) or at the RNA level (RNA-binding protein sites, RNA processing sites, microRNA targets, etc.) and their importance to HOTAIR function remains to be validated. (A) We performed 5'-and 3'-RACE (rapid amplification of cDNA ends) on a selection of 16 diverse roX2 orthologs to define their gene structure, including transcriptional start sites (TSS; green flags), alternative splicing (gray boxes and dashed lines), and polyadenylation sites (PAS; red arrowheads). These 16 species were selected for their diversity and representation of major fly clades. D. melanogaster roX2 is known to have numerous alternative splice forms (Park et al. 2005). The major isoform of D.
(B) RACE showed that all orthologs analyzed share a similar gene structure: a major isoform consisting of exon 1-exon 3 with minor isoforms from alternative splicing of "exon 2". TSS within the short first exon vary, as do PAS within exon 3, occurring most commonly 3' of roXbox-4, -5, or -6. The relative positions, number, and orientation of roXboxes (RB, red blocks) and inverted roXboxes (IRB, cyan blocks) are consistent across most species. Here, graphical alignment is relative to RB5, the most highly conserved of these sequences.
(C) Indeed, the RBs and IRB are the most highly conserved sequence elements in roX2 orthologs. The 8-nucleotide RB or IRB motif is underlined. Motifs were calculated over all discovered roX2 (and roX3) orthologs. Other weakly conserved sequences occur within "exon 2" and are likely implicated in alternative splicing (not shown). RB (red blocks) and IRB (cyan blocks) fold into stem-loops (colored arcs; circle plot).

Supplemental
The P1 stem-loop (red arcs) and RB4-IRB-RB5 alternative structures (cyan, blue, and purple) are the most conserved structures; see also Supplemental Fig. S6. For brevity, only representative species from major clades are shown. Figure S8. Within each species, roX1 and roX2 have the same binding sites, though with different absolute affinities. In all species, roX1 and roX2 bind to the same loci, though some are biased towards roX2. The signal from roX1 or roX2

Supplemental
ChIRP-seq in 1kb windows of the X-chromosome (ME-A in all species, plus ME-D in D. willistoni) was integrated, and plotted against one another. There is high correlation between roX1 and roX2 signal, especially for D. melanogaster, D. willistoni, and D.
virilis with greatly diminished correlation in D. busckii. The roX1 signal is substantially lower than roX2 signal for all species except D. melanogaster, indicating a bias towards roX2 as the dominant roX homolog in these species (see Fig. 3C; note that x-and y-axes are not equally scaled). The 5kb windows surrounding the roX1 and roX2 loci were excluded due to direct ChIRP oligo-genomic DNA hybridization and recovery. Figure S9. roX RNAs bind to dozens of autosomal sites, some of which are conserved X-linked HAS in D. willistoni.

Supplemental
(A-B) Two such HAS at the TSS / promoters of autosomal genes (on ME-D) are shown, RasGAP1 and Sox21b. These genes are autosomal in D. melanogaster (on ME-D), but X-linked in D. willistoni (due to its ME-A+D fusion). The HAS at the TSS / promoters of these two genes are conserved between D. melanogaster and D. willistoni; however, in D. willistoni an additional HAS is immediately downstream from the RasGAP1 stop codon (presumably in the 3'UTR). This suggests that preexisting autosomal binding sites may also serve as HAS after neo-sex chromosome karyotype fusions.
(C) The autosomal HAS in D. melanogaster are present on all autosomes (ME-B through -F), and are predominantly at TSS / promoters (as opposed to the intronic and 3'-UTR bias of X-linked HAS). These autosomal HAS are reproducible across roX1 and roX2 ! 7 ChIRP-seq and between different ChIRP-seq experiments in different cell types (Quinn et al. 2014). Interestingly, some of these genes have male-specific or malebiased expression, such as chinmo, Sox21b, and dac (not shown), suggesting that male-biased autosomal genes may coopt the dosage compensation complex to upregulate expression in a male-specific manner.
Supplemental Figure S10. The MRE motif is centered at roX2 ChIRP-seq peaks (HAS). The best-matched MRE motif within each roX2 ChIRP-seq peak (HAS) was calculated by CentriMo (Bailey et al. 2009) and plotted. In each species, the MRE motif is significantly centered, indicating the precision and high-fidelity with which ChIRP-seq can map precise roX binding sites. Figure S11. HAS on D. willistoni ME-D are similar to HAS on ME-A, despite its ancestry as an autosome.

Supplemental
(A) The MRE motifs from ME-A and ME-D are indistinguishable. We did not find evidence of tamed transposable elements at HAS on ME-D, as found on the more recently evolved neo-sex chromosome of the D. miranda (Ellison and Bachtrog 2013).
(B) ME-A and ME-D HAS are similarly distributed on genomic regions, and are enriched especially in introns. Note that UTRs are grouped with intergenic regions here.
(C) Intronic HAS are significantly biased towards the reverse-complement orientation of the MRE motif (CT-repeat) on both ME-A and ME-D.
(D) HAS are significantly proximal to PPT on both ME-A and ME-D. Figure S12. Overlap and proximity between homologous roX binding sites.

Supplemental
(A) Species-to-species liftover and HAS distance calculation strategy. Homologous regions in two species' genomes were mapped by genome-wide liftover. If homologous sites are both HAS, the HAS are overlapping and the distance between is 0. If a homologous site is a HAS in one species and not another, the distance to the nearest HAS is calculated (d>0). If the HAS has no nearby neighbor in the other species, the distance is much larger. Overlap and proximity between homologous HAS using the above strategy for all pairwise species comparisons. Though exact conservation of binding sites (i.e. distance = 0) between any two species is limited (approximately 10-30%), this is significantly higher than expected by random chance, as a random permutation of all HAS over their respective chromosomes or the whole genome yields very few overlapping peaks. Additionally, if exact peak overlap is lost (i.e. d>0), there is a high likelihood that another peak is nearby in the homologous genomic region, as indicated by the steep slope of the "observed" line in this distance regime.         intergenic  intron  TSS  intergenic  TSS  TSS  intron  TSS  TSS  intron  TSS  TSS  TSS  intron  TSS  TSS  intergenic  TSS  TSS  TSS  TSS  intron  intron  TSS  TSS p-value<2.2e-14 p-value=4.2e-5 p-value<1.8e-11 p-value=2.7e-4 p-value=1.7e-7 Quinn_Fig.S14