A common class of transcripts with 5′-intron depletion, distinct early coding sequence features, and N1-methyladenosine modification

Introns are found in 5′ untranslated regions (5′UTRs) for 35% of all human transcripts. These 5′UTR introns are not randomly distributed: Genes that encode secreted, membrane-bound and mitochondrial proteins are less likely to have them. Curiously, transcripts lacking 5′UTR introns tend to harbor specific RNA sequence elements in their early coding regions. To model and understand the connection between coding-region sequence and 5′UTR intron status, we developed a classifier that can predict 5′UTR intron status with >80% accuracy using only sequence features in the early coding region. Thus, the classifier identifies transcripts with 5′ proximal-intron-minus-like-coding regions (“5IM” transcripts). Unexpectedly, we found that the early coding sequence features defining 5IM transcripts are widespread, appearing in 21% of all human RefSeq transcripts. The 5IM class of transcripts is enriched for non-AUG start codons, more extensive secondary structure both preceding the start codon and near the 5′ cap, greater dependence on eIF4E for translation, and association with ER-proximal ribosomes. 5IM transcripts are bound by the exon junction complex (EJC) at noncanonical 5′ proximal positions. Finally, N1-methyladenosines are specifically enriched in the early coding regions of 5IM transcripts. Taken together, our analyses point to the existence of a distinct 5IM class comprising ∼20% of human transcripts. This class is defined by depletion of 5′ proximal introns, presence of specific RNA sequence features associated with low translation efficiency, N1-methyladenosines in the early coding region, and enrichment for noncanonical binding by the EJC.

A classifier that predicts 5UI status using only early coding sequence information. 1 To better understand the previously-reported enigmatic relationship between certain early 2 coding region sequences and the absence of a 5UI, we sought to model this relationship . 3 Specifically, we used a random forest classifier (Breiman 2001) to learn the relationship 4 between 5UI absence and a collection of 36 different nucleotide-level features extracted from 5 the first 99 nucleotides of all human coding regions (CDS) (Figs 1A-C; Table S1; Methods). 6 We then used all transcripts known to contain an SSCR (a total of 3743 transcripts clusters; 7 Methods), regardless of 5UI status, as our training set. This training constraint ensured that all 8 input nucleotide sequences were subject to similar functional constraints at the protein level. 9 Thus, we sought to identify sequence features that differ between 5UIand 5UI + transcripts at 10 the RNA level. 11 Our classifier assigns to each transcript a "5'UTR-intron-minus-predictor" (5IMP) score 12 between 0 and 10, where higher scores correspond to a higher likelihood of being 5UI - (Fig  13   1C). Interestingly, preliminary ranking of the 5UItranscripts by 5IMP score revealed a 14 relationship between the position of the first intron in the coding region and the 5IMP score. 15 5UItranscripts for which the first intron was more than 85 nts downstream of the start codon 16 had the highest 5IMP scores. Furthermore, the closer the first intron was to the start codon, 17 the lower the 5IMP score (Fig 1D). We explored this relationship further by training classifiers 18 that increasingly excluded from the training set 5UItranscripts according to the distance of 19 To minimize the impact of transcripts that may 'behave' as though they were 5UI + due 1 to an intron early in the coding region, we eliminated 5UI -SSCR transcripts with a first intron 2 <90 nts downstream of the start codon (Methods) and generated a new classifier. 3 Discriminative motif features were learned independently (Methods), and performance of this 4 scores were strongly skewed towards high 5IMP scores (peak at ~9; Fig 2E). These results 1 were consistent with the idea that SignalP + transcripts do contain many unannotated SSCRs. 2 Having considered SignalP + transcripts as well as SSCR-and MSCR-containing 3 transcripts, we used the classifier to calculate 5IMP scores for all remaining "S -/M -/SignalP -" 4 transcripts. Although the performance was weaker on this gene set, it was still better than 5 expected of a naive predictor (Fig 2A, green line). 5UI + S -/M -/SignalPtranscripts were 6 strongly skewed toward low 5IMP scores (Fig 2F). Surprisingly, however, a significant 7 fraction of 5UI -S -/M -/SignalPtranscripts had high 5IMP scores (~18%). Thus, our results 8 suggest a broad class of transcripts with early coding regions carrying sequence signals that 9 predict the absence of a 5'proximal intron, or in other words, a class of transcripts with 10 5'proximal-intron-minus-like coding regions. Hereafter we refer to transcripts in this class as 11 "5IM" transcripts. 12 We sought to identify what fraction of transcripts have 5IMP scores that exceed what 13 would be expected in the absence of 5UI --predictive early coding region signals. To establish 14 this expectation, we used the above-described negative control set of equal-length coding 15 sequences from outside of the early coding region. By quantifying the excess of high-scoring 16 sequences in the real distribution relative to this control distribution, we estimate that 21% of 17 all human transcripts are 5IM transcripts (a 5IMP score of 7.41 corresponds to a 5% False 18 Discovery Rate; Fig 2G). The set of 5IM transcripts defined by our classifier (Table S2) 19 includes many that do not encode ER-targeted or mitochondrial proteins. The distribution of 20 various classes of mRNAs among the 5IM transcripts was: 38% ER-targeted (SSCR or SignalP + ), 21 9% mitochondrial (MSCR) and 53% other classes (S -/M -/SignalP -) ( Figure 2G). These results 22 suggest that RNA-level features prevalent in the early coding regions of 5UI -SSCR and MSCR 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint transcripts are also found in other transcript types (Figs 2F-G), and that 5IM transcripts 1 represent a broad class. 2 3 Functional characterization of 5IM transcripts 4 5IM transcripts are defined by mRNA sequence features. Hence, we hypothesized that 5IM 5 transcripts may be functionally related through shared regulatory mechanisms mediated by 6 the presence of these common features. To this end, we collected large-scale datasets 7 representing diverse attributes covering six broad categories (see Table S3 for a complete list): 8 (1) Curated functional annotations --e.g., Gene Ontology terms, annotation as a 'housekeeping ' 9 gene, genes subject to RNA editing; (2) RNA localization --e.g., to dendrites, to mitochondria; 10 (3) Protein and mRNA half-life, ribosome occupancy and features that decrease stability of one 11 or more mRNA isoforms --e.g., AU-rich elements (4) Sequence features associated with 12 regulated translation --e.g., codon optimality, secondary structure near the start codon; (5) 13 Known interactions with RNA-binding proteins or complexes such as Staufen-1, TDP-43, or the 14 Exon Junction Complex (EJC) (6) RNA modifications -i.e., N 1 -methyladenosine (m 1 A). 15 We adjusted for multiple hypotheses testing at two levels. First, we took a conservative 16 approach (Bonferroni correction) to correct for the number of tested functional 17 characteristics. Second, some of the functional categories were analyzed in more depth and 18 multiple sub-hypotheses were tested within the given category. In this in-depth analysis a false 19 discovery-based correction was adopted. Below, all reported p-values remain significant (p-20 adjusted < 0.05) after multiple hypothesis test correction. 21 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint No associations between 5IM transcripts and features in categories (1), (2) and (3)  CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint transcripts with strong 5'UTR secondary structures (Parsyan et al. 2011). The eIF4E subunit 1 binds to the 7mGpppG 'methyl-G' cap and the ATP-dependent helicase eIF4A (scaffolded by 2 eIF4G) destabilizes 5'UTR secondary structure (Marintchev et al. 2009). A previous study 3 identified transcripts that were more actively translated under conditions that promote cap-4 dependent translation (overexpression of eIF4E) (Larsson et al. 2007). In agreement with the 5 observation that 5IM transcripts have more secondary structure upstream of the start codon 6 and near the 5'cap, transcripts with high 5IMP scores were more likely to be translationally, 7 but not transcriptionally, upregulated upon eIF4E overexpression (Fig 3C; Wilcoxon Rank Sum 8 Test p = 2.05e-22, and p = 0.28, respectively). 9 (IV) Non-AUG start codons. Transcripts with non-AUG start codons also have 10 intrinsically low translation initiation efficiencies (Hinnebusch and Lorsch 2012). These 11 mRNAs were greatly enriched among transcripts with high 5IMP scores (Fisher's Exact Test p 12 = 0.0003; odds ratio = 3.9) and have a median 5IMP score that is 3.57 higher than those with 13 an AUG start (Fig 3D).  Gingold and Pilpel 19 2011). We therefore examined the tRNA adaptation index (tAI), which correlates with copy 20 numbers of tRNA genes matching a given codon (dos Reis et al. 2004). Specifically, we 21 calculated the median tAI of the first 99 coding nucleotides of each transcript, and found that 22 5IMP score was negatively correlated with tAI (Fig S1A; Spearman Correlation rho= -0.23; p < 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint 2e-16 median tAI and 5IMP score). This effect was restricted to the early coding regions as the 1 negative control set of randomly chosen sequences downstream of the 3 rd exon from each 2 transcript did not exhibit a relationship between 5IMP score and codon optimality (Fig S1B-C). 3 Thus, 5IM transcripts show reduced codon optimality in early coding regions, suggesting that 4 5IM transcripts have decreased translation elongation efficiency. 5 To more precisely determine where the codon optimality phenomenon occurs within 6 the entire early coding region, we grouped transcripts by 5IMP score. For each group, we 7 calculated the mean tAI at codons 2-33 (i.e., nts 4-99). Across this entire region, 5IM 8 transcripts (5IMP >7.41; 5% FDR) had significantly lower tAI values at every codon except 9 codons 24 and 32 (Fig 3E; Wilcoxon Rank Sum test Holm-adjusted p < 0.05 for all 10 comparisons). To eliminate potential confounding variables, including nucleotide composition, 11 we performed several additional control analyses (Methods); none of these altered the basic 12 conclusion that 5IM transcripts have lower codon optimality than non-5IM transcripts across 13 the entire early coding region. 14 (VI) Ribosomes per mRNA. Finally, we examined the relationship between 5IMP score 15 and translation efficiency, as measured by the steady-state number of ribosomes per mRNA 16 molecule. To this end, we used a large dataset of ribosome profiling and RNA-Seq experiments 17 from human lymphoblastoid cell lines (Cenik et al. 2015). From this, we calculated the average 18 number of ribosomes on each transcript and identified transcripts with high or low ribosome 19 occupancy (respectively defined by occupancy at least one standard deviation above or below 20 the mean; see Methods). 5IM transcripts were slightly but significantly depleted in the high 21 ribosome-occupancy category (Fig 3F; Fisher's Exact Test p = 0.0006, odds ratio = 1.3). 22 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint Moreover, 5IMP scores exhibited a weak but significant negative correlation with the number 1 of ribosomes per mRNA molecule (Spearman rho=-0.11; p = 5.98e-23). 2 Taken together, all of the above results reveal that 5IM transcripts have sequence 3 features associated with lower translation efficiency, at the stages of both translation initiation 4 and elongation. 5 6 Non-ER trafficked 5IM transcripts are enriched in ER-proximal ribosome occupancy 7 We next investigated the relationship between 5IMP score and the localization of translation 8 within cells. Exploring the subcellular localization of translation at a transcriptome-scale 9 remains a significant challenge. Yet, a recent study described proximity-specific ribosome 10 profiling to identify mRNAs occupied by ER-proximal ribosomes in both yeast and human cells 11 (Jan et al. 2014). In this method, ribosomes are biotinylated based on their proximity to a 12 marker protein such as Sec61, which localizes to the ER membrane (Jan et al. 2014). For each 13 transcript, the enrichment for biotinylated ribosome occupancy yields a measure of ER-14 proximity of translated mRNAs. 15 We reanalyzed this dataset to explore the relationship between 5IMP scores and ER-16 proximal ribosome occupancy in HEK-293 cells. As expected, transcripts that exhibit the 17 highest enrichment for ER-proximal ribosomes were SSCR-containing transcripts and 18 transcripts with other ER-targeting signals. Yet, we noticed a surprising positive correlation 19 between ER-proximal ribosome occupancy and 5IM transcripts with no ER-targeting evidence 20 (Fig 4). This relationship was true for both mitochondrial genes (Fig 4; Spearman rho=0.43; p 21 < 2.2e-16), and genes with no evidence for either ER-or mitochondrial-targeting (Fig 4;  22 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint Spearman rho=-0.36; p < 2.2e-16). These results suggest that 5IM transcripts are more likely 1 than non-5IM transcripts to engage with ER-proximal ribosomes.  These 'non-canonical' EJC occupancy sites (ncEJC sites) were associated with multiple 19 sequence motifs, three of which were similar to known recognition motifs for SR proteins that 20 co-purified with the EJC core subunits (Singh et al. 2012). Interestingly, another motif (Fig 5B;  21 top) that was specifically found in first exons is not known to be bound by any known RNA-22 binding protein (Singh et al. 2012). This motif was CG-rich, a sequence feature that also 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint defines 5IM transcripts. This similarity presages the possibility of enrichment of first exon 1 ncEJC sites among 5IM transcripts. 2 Position analysis of called EJC peaks revealed that while only 9% of cEJCs reside in first 3 exons, 19% of all ncEJCs are found there. When we investigated the relationship between 5IMP 4 scores and ncEJCs in early coding regions, we found a striking correspondence-the median 5 5IMP score was highest for transcripts with the greatest number of ncEJCs (Fig 5C; Wilcoxon 6 Rank Sum Test; p < 0.0001). When we repeated this analysis by conditioning on 5UI status, we 7 similarly found that ncEJCs were enriched among transcripts with high 5IMP scores regardless 8 of 5UI status (Fisher's Exact Test, p < 3.16e-14, odds ratio > 2.3; Fig 5D). These results suggest 9 that the striking enrichment of ncEJC peaks in early coding regions was generally applicable to 10 all transcripts with high 5IMP scores regardless of 5UI presence. 11

Transcripts harboring N 1 -methyladenosine (m 1 A) have high 5IMP scores 12
It is increasingly clear that ribonucleotide base modifications in mRNAs are highly prevalent 13 and can be a mechanism for post-transcriptional regulation ( not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint position of the first intron, we investigated the relationship between 5IMP score and m 1 A RNA 1 modification marks. 2 We analyzed the union of previously identified m 1 A modifications (Dominissini et al.  3 2016) across all cell types and conditions. Although there is some evidence that these marks 4 depend on cell type and growth condition, it is difficult to be confident of the cell type and 5 condition-dependence of any particular mark given experimental variation (see Methods). 6 Nevertheless, we found that mRNAs with m 1 A modification early in the coding region (first 99 7 nucleotides) had substantially higher 5IMP scores than mRNAs lacking these marks (Fig 6A;  8 Wilcoxon Rank Sum Test p = 3.4e-265), and were greatly enriched among 5IM transcripts ( Fig  9   6A; Fisher's Exact Test p = 1.6e-177; odds ratio = 3.8). In other words, the sequence features 10 within the early coding region that define 5IMP transcripts also associate with m 1 A 11 modification in the early coding region. 12 We next wondered whether 5IMP score was related to m 1 A modification generally, or 13 only associated with m 1 A modification in the early coding region. Indeed, many of the 14 previously identified m 1 A peaks were within the 5'UTRs of mRNAs (Li et al. 2016). 15 Interestingly, 5IMP scores were only associated with m 1 A modification in the early coding 16 region, and not with m 1 A modification in the 5'UTR (Fig 6B). This offers the intriguing 17 possibility that the sequence features that define 5IMP transcripts are co-localized with m 1 A 18 modification. 19 20

Discussion: 21
. CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint Coordinating the expression of functionally related transcripts can be achieved by post-1 transcriptional processes such as splicing, RNA export, RNA localization or translation ( Moore 2 and Proudfoot 2009). Sets of mRNAs subject to a common regulatory transcriptional process 3 can exhibit common sequence features that define them to be a class. For example, transcripts 4 subject to regulation by particular miRNAs tend to share certain sequences in their 3'UTRs that 5 are complementary to the regulatory miRNA (Ameres and Zamore 2013). Similarly, transcripts 6 that share a 5' terminal oligopyrimidine tract are coordinately regulated by mTOR and 7 ribosomal protein S6 kinase (Meyuhas 2000). Here we quantitatively define '5IM' transcripts 8 as a class that shares common sequence elements and functional properties. We estimate the 9 5IM class to comprise 21% of all human transcripts. 10 Whereas 35% of human transcripts have one or more 5'UTR introns, the majority of 11 5IM transcripts have neither a 5'UTR intron nor an intron in the first 90 nts of the ORF. Other 12 shared features of 5IM transcripts include sequence features associated with low translation 13 initiation rates. These are: (1) a tendency for RNA secondary structure in the region 14 immediately preceding the start codon (Fig 3A), and near the 5'cap ( Fig 3B); (2) translational 15 upregulation upon overexpression of eIF4E ( Fig 3C); and (3) more frequent use of non-AUG 16 start codons (Fig 3D). Also consistent with low intrinsic translation efficiencies, 5IM 17 transcripts additionally tend to depend on less abundant tRNAs to decode the beginning of the 18 open reading frame (Fig 3E). 19 We had previously reported that transcripts encoding proteins with ER-and 20 mitochondrial-targeting signal sequences (SSCRs and MSCRs, respectively) are over-21 represented among the 65% of transcripts lacking 5'UTR introns (Cenik et al. 2011). 22 Transcripts in this set are enriched for the sequence features detected by our 5IM classifier. By 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint examining these enriched sequence features, we showed that the 5IM class extends beyond 1 mRNAs encoding membrane proteins. Jan et al. recently developed a transcriptome-scale 2 method to identify mRNAs occupied by ER-proximal ribosomes in both yeast and human cells 3 (Jan et al. 2014). As expected, transcripts known to encode ER-trafficked proteins were highly 4 enriched for ER-proximal ribosome occupancy. However, their data also showed many 5 transcripts encoding non-ER trafficked proteins to also be engaged with ER-proximal 6 ribosomes (Reid and Nicchitta 2015b). Similarly, several other studies have suggested a critical 7 role of ER-proximal ribosomes in translating several cytoplasmic proteins (Reid and Nicchitta 8 2015a). Here, we found that 5IM transcripts including those that are not ER-trafficked or 9 mitochondrial were significantly more likely to exhibit binding to ER-proximal ribosomes 10 (Figs 4A-B). 11 In addition to ribosomes directly resident on the ER, an interesting possibility is the 12 presence of a pool of peri-ER ribosomes (Jan et al. 2015; Reid and Nicchitta 2015a). 13 Association of 5IM transcripts with such a peri-ER ribosome pool could potentially explain the 14 observed correlation of 5IM status with binding to ER-proximal ribosomes. The ER is 15 physically proximal to mitochondria (Rowland and Voeltz 2012), so peri-ER ribosomes may 16 include those translating mRNAs on mitochondria (i.e., mRNAs with MSCRs) (Sylvestre et al. 17 2003). However, even when transcripts corresponding to ER-trafficked and mitochondrial 18 proteins were excluded from consideration, ER-proximal ribosome enrichment and 5IMP 19 scores were highly correlated (Fig 4A). Thus another shared feature of 5IM transcripts is their 20 translation on or near the ER regardless of the ultimate destination of the encoded protein . 21 In an attempt to identify a common factor binding 5IM transcripts, we asked whether 22 5IM transcripts were enriched among the experimentally identified targets of 23 different 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint RBPs. Only one RBP emerged--the exon junction complex (EJC). Specifically, we observed a 1 dramatic enrichment of non-canonical EJC (ncEJC) binding sites within the early coding region 2 of 5IM transcripts. Further, the CG-rich motif identified for ncEJCs in first exons is strikingly 3 similar to the CG-rich motif enriched in the first exons of 5IM transcripts (Fig 5B). Previous 4 work implicated RanBP2, a protein associated with the cytoplasmic face of the nuclear pore, as 5 a binding factor for some SSCRs ). This finding suggests that nuclear 6 pore proteins may influence EJC occupancy on these transcripts. junctions and involves direct contact between the sugar-phosphate backbone and the EJC core 12 anchoring protein eIF4AIII, ncEJC binding sites likely reflect stable engagement between the 13 EJC core and other mRNP proteins (e.g., SR proteins) recognizing nearby sequence motifs. 14 Although some RBPs were identified for ncEJC motifs found in internal exons (Singh et al. 15 2014(Singh et al. 15 , 2012, to date no candidate RBP has been identified for the CG-rich ncEJC motif found in 16 the first exon. If this motif does result from an RBP interaction, it is likely to be one or more of 17 the ~70 proteins that stably and specifically bind to the EJC core (Singh et al. 2012). 18 Finally, we observed a dramatic enrichment for m 1 A modifications among 5IM 19 transcripts, with specific enrichment for m 1 A modifications in the early coding region. Given 20 this striking enrichment perhaps it is perhaps not surprising that m 1 A containing mRNAs were 21 also shown to have more structured 5'UTRs that are GC-rich compared to m 1 A lacking mRNAs 22 (Dominissini et al. 2016). Similar to 5IM transcripts, m 1 A containing mRNAs were found to 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint decorate start codons that appear in a highly structured context. While ALKBH3 has been 1 identified as a protein that can demethylate m 1 A, it is currently unknown whether there are 2 any proteins that can specifically act as "readers" of m the intriguing hypothesis that one or more of the ~70 proteins that stably and specifically bind 7 to the EJC core can function as an m 1 A reader. Future work involving directed experiments 8 would be needed to test this hypothesis. 9 Given that 5IM transcripts are enriched for ER-targeted and mitochondrial proteins, it 10 is plausible that the observed functional characteristics of 5IM transcripts are driven solely by 11 SSCR and MSCR-containing transcripts. Hence, we repeated all analyses for the subclasses of 12 5IM transcripts (MSCR-containing, SSCR-containing, S -/M -/SignalP + , or S -/M -/SignalP -). We 13 found the observed associations remained statistically significant and had the same direction 14 of effect, even after eliminating SSCR-and MSCR-containing transcripts, despite the fact that all 15 training of the 5IM classifier was performed only using SSCR transcripts. We also found that 16 5IMP score was equally or more strongly associated with each of the functional characteristics 17 compared to the 5UI status. In conclusion, the molecular associations we report apply to 5IM 18 transcripts as a whole, and are not driven solely by the subset of 5IM transcripts encoding ER-19 or mitochondria-targeting signal peptides, and seem to indicate shared features beyond simple 20 lack of a 5'UTR intron. 21 An intriguing possibility is that 5IM transcript features associated with lower intrinsic 22 translation efficiency may together enable greater 'tunability' of 5IM transcripts at the 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available  Pruitt 4 et al. 2005). Transcripts with fewer than three coding exons, and transcripts where the first 99 5 coding nucleotides straddle more than two exons were removed from further consideration. 6 The criteria for exclusion of genes with fewer than three coding exons was to ensure that the 7 analysis of downstream regions was possible for all genes that were used in our analysis of 8 early coding regions. Specifically, the downstream regions were selected randomly from 9 downstream of the 3 rd exons. Hence, genes with fewer exons would not be able to contribute a 10 downstream region potentially creating a skew in representation. In total there were ~3000 11 genes that were removed from consideration due to this filter. Therefore, our classifier is 12 limited in its ability to assess transcripts from these genes. However, the performance 13 measures reported in our manuscript are robust to exclusion of these genes, in the sense that 14 the same class of transcripts was used in both training and test datasets. For each transcript cluster, we also constructed matched control sequences. Control 9 sequences were derived from a single randomly chosen in-frame 'window' downstream of the 10 3rd exon from the evaluated transcripts. If an evaluated transcript had fewer than 99 11 nucleotides downstream of the 3 rd exon, no control sequence was extracted. 5UI labels and 12 transcript clustering for the control sequences were inherited from the evaluated transcript. 13 The rationale for this decision is that our analysis depends on the position of the first intron; 14 hence genes with fewer than two exons need to be excluded, as these will not have introns. We 15 further required the matched control sequences to fall downstream of the early coding region. 16 In the vast majority of cases the third exon fell outside the first 99 nucleotides of the coding 17 region, making this a convenient criterion by which to choose control regions. 18

Sequence Features and Motif Discovery 19
36 sequence features were extracted from each transcript ( Table S1). The sequence 20 features included the ratio of arginines to lysines, the ratio of leucines to isoleucines, adenine 21 content, length of the longest stretch without adenines, preference against codons that contain 22 adenines or thymines. These features were previously found to be enriched in SSCR-23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint containing and certain 5UItranscripts (Cenik et al. 2011;Palazzo et al. 2007). In addition, we 1 extracted ratios between several other amino acid pairs based on having 2 biochemical/evolutionary similarity, i.e. having positive scores, according to the BLOSUM62 3 matrix (Henikoff and Henikoff 1992). To avoid extreme ratios given the relatively short 4 sequence length, pseudo-counts were added to amino acid ratios using their respective 5 genome-wide prevalence. 6 In addition, we used three published motif finding algorithms (AlignACE, DEME, and In total, six motifs were discovered using the three motif finding algorithms (Table S1). 17 Position specific scoring matrices for all motifs were used to score the first 99 -l positions in 18 each sequence, where l is the length of the motif. We assessed the significance of each motif 19 instance by calculating the p-value of enrichment (Fisher's Exact Test) among 5UItranscripts 20 considering all transcripts with a motif instance achieving a PSSM score greater than equal to 21 the instance under consideration. The significance score and position of the two best motif 22 instances were used as features for the classifier (Table S1). 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint 5IM Classifier Training and Performance Evaluation 1 We modified an implementation of the Random Forest classifier (Breiman 2001) to 2 model the relationship between sequence features in the early coding region and the absence 3 of 5'UTR introns (5UIs). This classifier discriminates transcripts with 5'proximal-intron-4 minus-like-coding regions and hence is named the '5IM' classifier. The training set for the 5 classifier was composed of SSCR transcripts exclusively. There were two reasons to restrict 6 model construction to SSCR transcripts: 1) we expected the presence of specific RNA elements 7 as a function of 5UI presence based on our previous work (Cenik et al. 2011); and 2) we 8 wanted to restrict model building to sequences that have similar functional constraints at the 9 protein level. 10 We observed that 5UItranscripts with introns proximal to 5' end of the coding region 11 have sequence characteristics similar to 5UI + transcripts (Fig 1D). To systematically 12 characterize this relationship, we built different classifiers using training sets that excluded 13 5UItranscripts with a coding region intron positioned at increasing distances from the start 14 codon. We evaluated the performance of each classifier using 10-fold cross validation. 15 Given that a large number of motif discovery iterations were needed, we sought to 16 reduce the computational burden. We isolated a subset of the training examples to be used 17 exclusively for motif finding. Motif discovery was performed once using this set of sequences, 18 and the same motifs were used in each fold of the cross validation for all the classifiers. 19 Imbalances between the sizes of positive and negative training examples can lead to 20 detrimental classification performance (Wang and Yao 2012). Hence, we balanced the training 21 set size of 5UIand 5UI + transcripts by randomly sampling from the larger class. We 22 constructed 10 sub-classifiers to reduce sampling bias, and for each test example, the 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint prediction score from each subclassifier was summed to produce a combined score between 0 1 and 10. For the rest of the analyses, we used the classifier trained using 5UItranscripts where 2 first coding intron falls outside the first 90 coding nucleotides (Fig 1E). 3 We evaluated classifier performance using a 10-fold cross validation strategy for SSCR-4 containing transcripts (i.e. the training set). In each fold of the cross-validation, the model was 5 trained without any information from the held-out examples, including motif discovery. For all 6 the other transcripts and the control sets (see above), the 5IMP scores were calculated using 7 the classifier trained using SSCR transcripts as described above. 5IMP score distribution for 8 the control set was used to calculate the empirical cumulative null distribution. Using this 9 function, we determined the p-value corresponding to the 5IMP score for all transcripts. We 10 corrected for multiple hypotheses testing using the qvalue R package (Storey 2003). Based on 11 this analysis, we estimate that a 5IMP score of 7.41 corresponds to a 5% False Discovery Rate 12 and suggest that 21% of all human transcripts can be considered as 5IM transcripts. 13 While the theoretical range of 5IMP scores is 0-10, the highest observed 5IMP is 9.855. 14 We note that for all figures that depict 5IMP score distributions, we displayed the entire 15 theoretical range of 5IMP scores (0-10). 16

Functional Characterization of 5IM Transcripts: 17
We collected genome-scale dataset from publically available databases and from 18 supplementary information provided in selected articles. For all analyzed datasets, we first 19 converted all gene/transcript identifiers (IDs) into RefSeq transcript IDs using the Synergizer 20 webserver (Berriz and Roth 2008). If a dataset was generated using a non-human species (ex. 21 Targets identified by TDP-43 RNA immunoprecipitation in rat neuronal cells), we used 22 Homologene release 64 (downloaded on Sep 28 2009) to identify the corresponding ortholog 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint in humans. If at least one member of a transcript cluster was associated with a functional 1 phenotype, we assigned the cluster to the positive set with respect to the functional phenotype. 2 If more than one member of a cluster had the functional phenotype, we only retained one copy 3 of the cluster unless they differed in a quantitative measurement. For example, consider two 4 hypothetical transcripts: NM_1 and NM_2 that were clustered together and have a 5IMP score 5 of 8.5. If NM_1 had an mRNA half-life of 2hr while NM_2's half-life was 1hr than we split the 6 cluster while preserving the 5IMP score for both NM_1 and NM_2. 7 Once the transcripts were partitioned based on the functional phenotype, we ran two 8 statistical tests: 1) Fisher's Exact Test for enrichment of 5IM transcripts within the functional 9 category; 2) Wilcoxon Rank Sum Test to compare 5IMP scores between transcripts partitioned 10 by the functional phenotype. Additionally, for datasets where a quantitative measurement was 11 available (ex. mRNA half-life), we calculated the Spearman rank correlation between 5IMP 12 scores and the quantitative variable. In these analyses, we assumed that the test space was the 13 entire set of RefSeq transcripts. For all phenotypes where we observed a preliminary 14 statistically significant result, we followed up with more detailed analyses described below. 15

Analysis of Features Associated with Translation: 16
For each transcript, we predicted the propensity for secondary structure preceding the 17 translation start site and the 5'cap. Specifically, we extracted 35 nucleotides preceding the 18 translation start site or the first 35 nucleotides of the 5'UTR. If a 5'UTR is shorter than 35 19 nucleotides, the transcript was removed from the analysis. hybrid-ss-min utility (UNAFold 20 package version 3.8) with default parameters was used to calculate the minimum folding 21 energy (Markham and Zuker 2008). 22 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint Codon optimality was measured using the tRNA Adaptation Index (tAI), which is based 1 on the genomic copy number of each tRNA (dos Reis et al. 2004). tAI for all human codons 2 were downloaded from Tuller et al. 2010 Table S1. tAI profiles for the first 30 amino acids  3 were calculated for all transcripts. Codon optimality profiles were summarized for the first 30 4 amino acids for each transcript or by averaging tAI at each codon. 5 We carried out two control experiments to test whether the association between 5IMP 6 score and tAI could be explained by confounding variables. First, we permuted the nucleotides 7 in the first 90 nts and observed no relationship between 5IMP score and mean tAI when these 8 permuted sequences were used (Fig S1). Second, we selected random in-frame 99 nucleotides 9 from 3 rd exon to the end of the coding region and found no significant differences in tAI (Fig  10   S1). These results suggest that the relationship between tAI and 5IMP score is confined to the 11 first 30 amino acids and is not explained by simple differences in nucleotide composition. 12 Ribosome profiling and RNA expression data for human lymphoblastoid cells (LCLs) 13 were downloaded from NCBI GEO database accession number: GSE65912. Translation 14 efficiency was calculated as previously described (Cenik et al. 2015). Median translation 15 efficiency across the different cell-types was used for each transcript. 16

Analysis of Proximity Specific Ribosome Profiling Data: 17
We downloaded proximity specific ribosome profiling data for HEK 293 cells from Jan 18 et al. 2014 ; Table S6. We converted UCSC gene identifiers to HGNC symbols using g:Profiler 19 (Reimand et al. 2011). We retained all genes with an RPKM >5 in either input or pulldown and 20 required that at least 30 reads were mapped in either of the two libraries. We used ER-21 targeting evidence categories "secretome", "phobius", "TMHMM", "SignalP", "signalSequence", 22 "signalAnchor" from (Jan et al. 2014) to annotate genes as having ER-targeting evidence. The 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint genes that did not have any ER-targeting evidence or "mitoCarta" /"mito.GO" annotations were 1 deemed as the set of genes with no ER-targeting or mitochondrial evidence. We calculated the 2 log2 of the ratio between ER-proximal ribosome pulldown RPKM and control RPKM as the 3 measure of enrichment for ER-proximal ribosome occupancy (as in Jan et al. 2014). A moving 4 average of this ratio was calculated for genes grouped by their 5IMP score. For this calculation, 5 we used bins of 30 mitochondrial genes or 100 genes with no evidence of ER-or mitochondrial 6 targeting. 7

Analysis of Genome-wide Binding Sites of Exon Junction Complex: 8
Dr. Gene Yeo and Gabriel Pratt generously shared uniformly processed peak calls for 9 experiments identifying human RNA binding protein targets. These datasets include various 10 CLIP-Seq datasets and its variants such as iCLIP. A total 49 datasets from 22 factors were 11 analyzed. These factors were: hnRNPA1, hnRNPF, hnRNPM, hnRNPU, Ago2, hnRNPU, HuR, 12 IGF2BP1, IGF2BP2, IGF2BP3, FMR1, eIF4AIII, PTB, IGF2BP1, Ago3, Ago4, MOV10, Fip1, CF 13 Im68, CF Im59, CF Im25, and hnRNPA2B1. We extracted the 5IMP scores for all targets of each 14 RBP. We calculated the Wilcoxon Rank Sum test statistic comparing the 5IMP score 15 distribution of the targets of each RBP to all other transcripts with 5IMP scores. None of the 16 tested RBP target sets had an adjusted p-value < 0.05 and a median difference in 5IMP score > 17 1 when compared to non-target transcripts. 18 In addition, we used RNA:protein immunoprecipitation in tandem (RIPiT) data to 19 determine Exon Junction Complex (EJC) binding sites (Singh et al. 2014(Singh et al. , 2012. We analyzed 20 the common peaks from the Y14-Magoh immunoprecipitation configuration (Singh et al. 2012;21 Kucukural et al. 2013). Canonical EJC binding sites were defined as peaks whose weighted 22 center were 15 to 32 nucleotides upstream of an exon-intron boundary. All remaining peaks 23 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint were deemed as "non-canonical" EJC binding sites. We extracted all non-canonical peaks that 1 overlapped the first 99 nucleotides of the coding region and restricted our analysis to 2 transcripts that had an RPKM greater than one in the matched RNA-Seq data. (blue) were enriched for higher 5IMP scores. Light green shading indicates 5IMP scores 4 corresponding to 5% FDR. (d) Transcripts with non-AUG start codons (blue) exhibited 5 significantly higher 5IMP scores than transcripts with a canonical ATG start codon (yellow). 6 (e) Higher 5IMP scores were associated with less optimal codons (as measured by the 7 tRNA adaptation index, tAI) for the first 33 codons. For all transcripts within each 5IMP 8 score category (blue-high, orange-low), the mean tAI was calculated at each codon position. 9 Start codon was not shown. (f) Transcripts with lower translation efficiency were enriched 10 for higher 5IMP scores. Transcripts with translation efficiency one standard deviation 11 below the mean ("LOW" translation, yellow) and one standard deviation higher than the 12 mean ("HIGH" translation, blue) were identified using ribosome profiling and RNA-Seq data 13 from human lymphoblastoid cell lines (Methods). was calculated by grouping genes by 5IMP score (see Methods). We plotted the moving 18 average of 5IMP scores for transcripts with no evidence of ER-or mitochondrial targeting 19 (green) or for transcripts predicted to be mitochondrial (purple). We plotted a random 20 subsample of transcripts on top of the moving average (circles). 21

22
. CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint to have less optimal codons in their first 30 amino acids as measured by tRNA adaptation index 21 (tAI). The median tAI for each transcript was calculated and transcripts were grouped by their 22 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint 5IMP scores. The distribution of median tAI was plotted as a boxplot. (b) We permuted the 1 nucleotides of the first 99 nucleotides and found that the relationship between 5IMP and tAI 2 was lost. This result suggested that nucleotide composition alone doesn't explain the 3 relationship between 5IMP and tAI. (c) We used the previously described negative control 4 sequences, each derived from a single randomly chosen in-frame 'window' downstream of the 5 3rd exon from one of the evaluated transcripts. We found no relationship between 5IMP score 6 and tAI in these regions suggesting that the observed association is restricted to the early 7 coding region. 8 Table S1-| List of features used by the 5IM classifier. 9 Table S2-| 5IMP scores of all human transcripts. 10 Table S3-| List of functional features tested for association with 5IMP scores. 11 . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available  . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint All 5IM 0% 40% 80% SignalP-SignalP+ SSCR MSCR . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint  . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint  . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available . CC-BY-ND 4.0 International license under a not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available The copyright holder for this preprint (which was this version posted September 6, 2016. ; https://doi.org/10.1101/057455 doi: bioRxiv preprint