Shortening of 3′ UTRs in most cell types composing tumor tissues implicates alternative polyadenylation in protein metabolism

During pre-mRNA maturation 3′ end processing can occur at different polyadenylation sites in the 3′ untranslated region (3′ UTR) to give rise to transcript isoforms that differ in the length of their 3′ UTRs. Longer 3′ UTRs contain additional cis-regulatory elements that impact the fate of the transcript and/or of the resulting protein. Extensive alternative polyadenylation (APA) has been observed in cancers, but the mechanisms and roles remain elusive. In particular, it is unclear whether the APA occurs in the malignant cells or in other cell types that infiltrate the tumor. To resolve this, we developed a computational method, called SCUREL, that quantifies changes in 3′ UTR length between groups of cells, including cells of the same type originating from tumor and control tissue. We used this method to study APA in human lung adenocarcinoma (LUAD). SCUREL relies solely on annotated 3′ UTRs and on control systems such as T cell activation, and spermatogenesis gives qualitatively similar results at much greater sensitivity compared to the previously published scAPA method. In the LUAD samples, we find a general trend toward 3′ UTR shortening not only in cancer cells compared to the cell type of origin, but also when comparing other cell types from the tumor vs. the control tissue environment. However, we also find high variability in the individual targets between patients. The findings help in understanding the extent and impact of APA in LUAD, which may support improvements in diagnosis and treatment.


INTRODUCTION
The processing of most human pre-mRNAs involves 3 ′ end cleavage and addition of a polyadenosine [poly(A)] tail. Typically, there are multiple cleavage and polyadenylation sites within a gene, and alternative polyadenylation (APA) has emerged as a major source of transcriptome diversity (Reyes and Huber 2018). A prevalent type of APA isoforms are those that differ only in the length of their 3 ′ untranslated regions (3 ′ UTRs). 3 ′ UTRs become shorter upon T cell activation (Sandberg et al. 2008;Gruber et al. 2014), in cancer cells (Mayr and Bartel 2009;Xia et al. 2014) and upon induction of reprogramming in somatic cells (Ji and Tian 2009). Although the responsible regulators are still to be determined, core 3 ′ end processing factors under the transcriptional control of cell cycle-related transcription factors have been implicated, at least in the context of cell proliferation (Elkon et al. 2012). Various RNA-binding proteins (RBPs) are also involved in specific cellular sys-tems (Martin et al. 2012;Gruber et al. 2018b;So et al. 2019;Masuda et al. 2020;Lee et al. 2021).
While APA-dependent 3 ′ UTR shortening has been observed in many cancers (Xia et al. 2014;Schmidt et al. 2018), it is presently unclear whether it is a manifestation of the change in cell composition of the tissue or of functional changes in all cell types within the tumor environment. As single cell RNA sequencing (scRNA-seq) technologies specifically capture mRNA 3 ′ ends, and data sets of tumor and matched control tissue samples have started to become available, this question can now be addressed, provided a few challenges are overcome. First, the number of transcripts that can be reliably quantified is still low (Breda et al. 2021), because the total number of reads obtained from individual cells is in the 10 3 -10 4 range. Thus, quantifying gene expression at the isoform level is still very challenging. This issue can be partially circumvented by pooling the reads from cells of the same type. Second, while 3 ′ biased, scRNA-seq reads do not always reach the polyadenylation site (PAS) and may also result from internal priming. Thus, identifying which reads correspond to the same 3 ′ end is also not trivial. This problem can be mitigated by associating scRNA-seq reads with already-annotated transcript 3 ′ ends. However, the current annotation is still far from complete (Gruber et al. 2018a), leading to PAS usage quantification that is imprecise and incomplete. For this reason we developed a PAS-agnostic approach for quantifying changes in 3 ′ UTR length between samples, based on the entire 3 ′ end read distribution along the 3 ′ UTR. Applying the method to single cell sequencing data from human lung adenocarcinoma (LUAD), we found that 3 ′ UTR shortening is not specific to a cell type but rather occurs in most cell types that compose the tumor. Furthermore, our analysis revealed that the targeted transcripts encode proteins that are involved in various steps of protein metabolism, including synthesis at the endoplasmic reticulum (ER), transport between ER and the Golgi network and finally secretion of proteins. Our data thus implicates APA in the remodeling of protein metabolism in tumors.

A myeloid to lymphoid switch in lung tumors
While analyses of bulk RNA-seq data revealed the shortening of 3 ′ UTRs in virtually all studied cancers with respect to matched control tissue, the shortening is especially pronounced in lung tumors (Gruber et al. 2018b). Thus, to better understand the mechanism and function of APA in cancers, we identified two studies in which single cell sequencing of lung adenocarcinoma (LUAD) and matched control tissue from multiple patients was carried out on the same platform, 10x Genomics (Lambrechts et al. 2018;Laughney et al. 2020). These data enable us to not only identify 3 ′ UTR changes in specific cell types, but also to assess their generality between studies and patients. We followed the procedure described in Lambrechts et al. (2018) to annotate the type of individual cells. Briefly, we integrated the data with the harmony package (see Materials and Methods; Supplemental Fig. 1), clustered the normalized gene expression vectors of all cells ( Fig. 1A) with the Seurat package (Butler et al. 2018), and annotated the type of 38,156 cells from 12 samples of the Lambrechts et al. (2018) study (samples 3a-d, 4ad, 6a-d, representing three tumor samples and a matched control for each of three patients) and 18,543 cells of the Laughney et al. (2020) study (three pairs of tumor-matched control samples) based on known markers. We used the markers proposed in the Lambrechts et al. (2018) study, but also added a few markers for mast cell (TPSAB1, TPSB2, and CPA3; Table 1; Fig. 1B; Dwyer et al. 2016). As described in the initial study (Lambrechts et al. 2018), the most abundant cell types in the tumor samples were T cells, myeloid and B cells, while the matched control samples were dominated by myeloid and alveolar cells (Fig.  1C). We further identified a small cluster of mast cells, annotated as B cells in the initial study that did not consider mast cell markers. We observed a similar myeloid to T cell switch between control and cancer samples from the Laughney et al. (2020) study (Fig. 1D). In addition, the matched control samples from this latter study had a more homogenous cell-type composition compared to those from the Lambrechts et al. (2018) study, consisting almost exclusively of lymphocytes and myeloid cells (Fig. 1D).
Given that T cells are the most numerous cell type in tumor samples and that T cell activation leads to 3 ′ UTR shortening (Sandberg et al. 2008;Gruber et al. 2014), we wondered whether the pattern of 3 ′ UTR usage that was previously inferred from "bulk" samples can be attributed to the infiltration of the tumor with activated T cells. To investigate this possibility, we first determined the distribution of RNA molecules (unique molecular identifiers, UMI) per cell in various cell types in the two studies (Supplemental Fig. 2A) and the total number of UMIs obtained from each cell type in each data set (Supplemental Fig. 2B). While T cells were the most numerous cell type in tumors, their relatively small RNA content per cell led to a smaller overall contribution to the total RNA pool compared to the less numerous myeloid cells, which have substantially more RNA molecules per cell (Supplemental Fig.  2A). Thus, the "bulk" RNA obtained from tumor samples is not dominated by RNA originating from T cells, suggesting that other cell types also contribute to the 3 ′ UTR shortening that was previously described in tumors. We therefore carried out a cell-type-specific analysis of 3 ′ UTR usage in tumors relative to matched controls.
A PAS-agnostic approach to quantify 3 ′ ′ ′ ′ ′ UTR shortening and APA events A few approaches have been proposed for assessing APA in scRNA-seq data sets (Shulman and Elkon 2019;Patrick et al. 2020;Wu et al. 2020). However, their robustness with respect to the sparsity of the data and the incompleteness of PAS annotation has not been checked (Ye et al. 2020). Thus, we developed a novel approach (single cell analysis of 3 ′ untranslated region lengths, SCUREL) ( Fig.  2A), specifically designed to circumvent these issues and implemented in a Snakemake (Koster and Rahmann 2012) workflow. SCUREL enables two different comparisons of 3 ′ UTR length: between two different cell types in a data set ("cell type" mode), or for the same cell type between two different conditions (e.g., tumor and matched control tissue, "condition" mode). We frame the detection of changes in 3 ′ UTR length between two groups of cells as a problem of identifying the cell group from which the reads originated by inspecting the positions where the reads map in the terminal exons (TEs). That is, read 3 ′ ends are tabulated and the cumulative coverage along individual TEs is calculated and normalized (Fig. 2B). Then, analyzing each TE individually, we record the fraction of reads from the two cell groups that map within an extending window of the TE starting from the 3 ′ end (Fig. 2C). This yields a curve in the plane defined by the proportions of reads in the two cell groups, which is similar to a receiver operating characteristic (ROC). The area under this curve (AUC) indicates the similarity of TE length between the compared cell groups. The curve is anchored at coordinates (0,0), corresponding to the end of the TE, where no reads have been observed yet, and (1,1), corresponding to the start of the TE, where all reads from the TE have been accounted for. If the coverage of a TE by read 3 ′ ends were similar between the two groups of cells and thus the cell group cannot be identified from the position of the reads, the curve would trace the diagonal line. Deviations above the diagonal indicate higher coverage of the distal region of the TE in the cell group represented on the y-axis, while deviations below the diagonal line indicate higher coverage of the distal TE region in the cell group represented on the x-axis. When the number of reads mapping to a given TE is small, the curve will show discrete jumps of 1/n step size (where n is the number of reads mapping to the TE), as individual reads are encountered along the TE. This could lead to AUC values that deviate strongly from the 0.5 value expected under the assumption of similar coverage in the two cell groups. To avoid false positives that are caused by these finite sampling effects, we constructed a background coverage data set by randomizing the labels indicating the cell group from which each read originated. This preserves the depth of coverage of each  Table 1). The dot plot was created with Seurat. (C) Two-dimensional projection (created with Seurat) of gene expression vectors as in A, but highlighting only cells from one study in each panel. (D) Box plot of relative proportion of each cell type in control (green) and tumor (red) samples from individual patients from the Lambrechts and Laughney data sets. TE in each group of cells while randomizing the location of each read, thus allowing us to determine changes in 3 ′ UTR length that cannot be explained by the sparsity of the data. For considerations of efficiency, we carried out the randomization once, and used the information from TEs with similar average coverage to detect significant AUC values. That is, the distribution of AUC values being wider for TEs with low coverage (in counts per million, CPM) compared to TEs with high coverage (Fig. 2D), we binned TEs by the average coverage in the two cell groups [in log(mean CPM)] and within each of the 20 bins, we used the 1% quantile of the randomized read data as the threshold for significant AUC values. Finally, noting that in some cases the difference in the TE exon was small and unlikely to be due to APA, we selected only those TEs for which the read 3 ′ ends span a sufficiently large distance. That is, we calculated the interquartile range (IQR) of read 3 ′ end positions and, if the union of these intervals for the two cell clusters that were analyzed was larger than 200 nt, we considered the range of 3 ′ end variation sufficient to be indicative of APA (Fig. 2D).

SCUREL detects 3 ′ ′ ′ ′ ′ UTR length changes in previously characterized systems
To validate our approach, we analyzed the dynamics of 3 ′ UTR length in two well-characterized cellular systems, namely T cell activation, where 3 ′ UTRs become shorter, and sperm cell development, where the 3 ′ UTRs are known to become longer. Furthermore, we compared our results with those generated on these data sets by the previously published scAPA method (Shulman and Elkon 2019).
We annotated the mouse T cell scRNA-seq data (Pace et al. 2018) with Seurat, obtaining 1605 activated and 1535 naïve T cells (Fig. 3A), with 5.8 and 1.8 million reads mapped to TEs, respectively. Applying SCUREL, we identified 261 TEs whose length changed significantly upon T cell activation, of which 218 (84%) became shorter (Fig.  3B). These results recapitulate those obtained from bulk RNA sequencing in a similar system (Gruber et al. 2014). Applying the previously published scAPA method (Shulman and Elkon 2019) (see Materials and Methods) we only obtained 14 TEs with a significant length change, 12 of which (85%) became shorter (Fig. 3C). Two-thirds of the scAPA-identified targets (eight of 12 TEs) were also identified by our method, while the four cases missed by SCUREL involved either very small TE length changes (three cases) or a difference in the annotation of the TE, because scAPA also quantifies PAS downstream from annotated TEs. In contrast, inspection of nine randomly chosen TEs identified only by SCUREL indicated that they correspond to genes with relatively low expression, which are overlooked by scAPA (Supplemental Fig. 3). Examples of TEs from each of these categories are shown in Figure 3G.
We carried out a similar analysis on a mouse spermatogenesis data set (Lukassen et al. 2018), as it is well known that 3 ′ UTRs become progressively shorter during maturation of germ cells (spermatogonia) to spermatocytes, spermatids and finally spermatozoa. We used the markers described in the original publication (Lukassen et al. 2018) to annotate 386 elongating spermatids (ES) and 667 spermatocytes (SC), with 8 and 12 million reads in the TE regions, respectively (Fig. 3D). Applying SCUREL, we found 2060 TEs whose length changed significantly from SCs to ES, almost all of which (1992, 97%) became shorter (Fig. 3E). scAPA yielded a similar proportion of shortened TEs (but fewer in absolute number), 96% (165 of 171 significant APA events, Fig. 3F). As in the case of T cells, most of the scAPA-identified TEs were also found by our method (146 of 165 TEs), while TE annotation and small changes in PAS usage accounted for the cases that were unique to scAPA. Inspection of nine randomly chosen TEs identified only by SCUREL indicated that they correspond to genes with relatively low expression or exclusively express one PAS or the other (Supplemental Fig. 4).

Genes involved in protein metabolism are targets of 3 ′ ′ ′ ′ ′ UTR shortening in lung cancer cells
Having established that our method reproduces previously reported patterns of 3 ′ UTR length change in physiological settings, we then turned to the question of whether 3 ′ UTRs are also different in lung cancer cells compared to their nonmalignant counterpart, the alveolar epithelial cells. We identified 1330 TEs that were shorter in the 3607 cancer compared to the 851 alveolar cells in the Lambrechts data Based on Lambrechts et al. (2018).
set (with 22 and 3.7 million reads in TEs, respectively), representing 98% of 1357 significant events (Fig. 4A, top). Similarly, we identified 188 shortened TEs from the Laughney data set of 489 cancer and 292 alveolar cells (with 6 and 1.3 million reads in TEs, respectively), representing 85% of 219 significant events (Fig. 4A, bottom). While much fewer events were found in the Laughney data set, the majority (105 of 188 TEs, 56%) were shared with the Lambrechts data set. To determine whether specific biological processes are subject to APA-dependent regulation in cancer cells, we submitted the set of 105 shared genes to functional analysis via the STRING web server (Szklarczyk et al. 2019). This revealed that the corresponding proteins are associated with membranes, vesicles and granules (Fig.  4B,C). Interestingly, these APA targets cover the entire lifecycle of membrane and secreted proteins, from synthesis (i.e., translation initiation factors and ribosomal proteins), to traffic into the ER (e.g., SSR1, SPCS3, SEC63) and Golgi (e.g., TRAPPC3, KDELR2), to proteasome-mediated degradation (PSMD12). Some of the APA targets are surface receptors with well-known involvement in cancers (CD44, CD47, and CD59). These results indicate that APA contributes to the orchestration of protein metabolism and traffic in cancer cells. Examples of TEs from Figure 4B are shown in Figure 4D.

Conserved targets of 3 ′ ′ ′ ′ ′ UTR shortening in individual cell types
The next question we wanted to answer is whether 3 ′ UTR shortening affects all cells in the tumor environment, or it is rather restricted to specific cell types. We thus carried out the SCUREL analysis for each individual cell type for which we had at least ∼20 cells in each data set, comparing TE lengths between cells of the same type, from the tumor sample and matched control sample. We found many more TEs becoming significantly shorter than longer (Fig. 5A,B), across almost all cell types and in both data sets. This is summarized in Figure 5C, which shows that the proportion of shortened among significantly changed TEs is almost always greater than 0.5. By grouping all reads from the tumors and from matched control samples, respectively, we also recapitulated the result of previous "bulk" RNA-seq data analyses (Fig. 5D). Thus, 3 ′ UTR shortening is not restricted to a specific cell type, but seems to generally take place in all cell types, associated with the tumor environment. Moreover, in spite of the differences between the studies, there was a highly significant overlap between the targets of TE shortening in individual cell types (Fig. 5E,F). To gain further insight into the processes that may be regulated by APA, we submitted the intersection sets of genes exhibiting TE shortening in T lymphocytes and myeloid cells in these studies to functional enrichment analysis. We found significant enrichments especially in cellular components such as membranes, vesicles and granules (Fig.  5G,H), similar to what we observed in cancer cells.

Variability in 3 ′ ′ ′ ′ ′ UTR shortening among individuals
Finally, we asked to what extent are the targets of 3 ′ UTR shortening similar across patients. To answer this question, we analyzed individually the cells obtained from three patients in the Lambrechts study. Interestingly, in spite of the similar histopathological classification of the samples, one of the three samples was markedly different from the others, not exhibiting any tendency toward 3 ′ UTR shortening ( Fig. 6A-D). The other two samples showed highly significant overlaps between shortened 3 ′ UTRs in different cell types (Fig. 6E). Analysis of biological process enrichment in individual cell types based on the genes targeted in both of these patients reinforced the concept that transport processes are affected in multiple cell types (Fig. 6F). It also provided further granularity. For example, leukocyte activation and secretion are terms enriched in the myeloid cell data, whereas metabolic processes are enriched in T cells, interaction with immune cells in endothelial cells and interaction with endothelial cells and angiogenesis in fibroblasts. Altogether these data demonstrate the power of SCUREL identifying changes in APArelated changes in 3 ′ UTR length, revealing common functional themes, despite substantial variability between samples. A complete table of genes with significant 3 ′ UTR shortening across all LUAD comparisons we conducted is available in Supplemental Table 1.
The data further indicate that protein transport processes and intercellular communication are preferential targets of APA across multiple cell types.

DISCUSSION
The remodeling of gene expression in cancers involves, among other processes, alternative polyadenylation. A tendency toward 3 ′ UTR shortening has been generally observed, though to different extents, in virtually all studied cancers (Xia et al. 2014;Schmidt et al. 2018). Whether this is the result of changes in the cell-type composition of the tissue or to cancer-related changes in functionality in all cell types has not been investigated so far. We set out to answer this question, taking advantage of single cell sequencing data sets obtained from human lung adenocarcinoma. As the sparsity of the scRNA-seq data poses some challenges (Lähnemann et al. 2020) we sought two distinct studies that used the same sequencing platform, to identify shared patterns of variation. Furthermore, we developed an approach that controls for both imperfect annotation of transcript isoforms and low read coverage in scRNA-seq.
Comparing data from cells of the same type, but originating either from tumor samples or from matched control tissue, we found similar tendencies toward 3 ′ UTR shortening in the tumor environment for most cell types. Furthermore, the proteins encoded by the transcripts that are affected in various cell types cluster into specific functional classes, specifically the synthesis, traffic, secretion and degradation of proteins. This implicates APA in the regulation of protein metabolism and the organization of subcellular structure.
Initial studies that described the phenomenon of 3 ′ UTR shortening in T cells and cancer cells proposed a role in the regulation of protein levels, as short 3 ′ UTR isoforms are more stable than those with long 3 ′ UTRs (Sandberg et al. 2008;Mayr and Bartel 2009). However, when the decay rates of 3 ′ UTR isoforms were measured, they turned out to be rather similar (Spies et al. 2013;Gruber et al. 2014), leaving open the question of functional differences between 3 ′ UTR isoforms (Mayr 2018). More recent work uncovered additional layers of 3 ′ UTR-mediated regulation. For example, a role of 3 ′ UTRs in the localization of the translated protein (UDPL) has been described for a number of membrane proteins, including the immunoglobulin family member CD47, whose localization to the cell membrane protects host cells from phagocytosis by macrophages (Berkovits and Mayr 2015). Interestingly, CD47 is a conserved APA target in both LUAD data sets that we analyzed here, its 3 ′ UTR becoming shorter in The interaction network (from the STRING web server) of proteins whose transcripts undergo 3 ′ UTR shortening in both data sets. (C) Functional enrichment analysis for genes whose TEs undergo shortening in cancer cells. Shown are the top 10 GO biological process terms (sorted by the false discovery rate, FDR). Analysis was performed with STRING web server, using as background the set of genes found to be expressed in the lung samples. (D) Read coverage along TEs for a few example genes from panel B (EIF1, CD44, and CD59). Each panel shows four tracks per data set, blue: cancer cells, red: alveolar cells, coverage of the TE by reads (top track) and the cumulative coverage of the TE by read 3 ′ ends (bottom track). In all cases, the 3 ′ UTRs are shorter in cancer compared to alveolar cells.
cancer cells compared to lung alveolar cancer cells. This would predict decreased localization of CD47 to the surface of cancer cells, making them more susceptible to apoptosis compared to normal alveolar cells. This may explain why increased levels of CD47 are associated with increased cancer-free survival of patients with lung cancers (kmplot.com [Nagy et al. 2021]). It will be very interesting to apply methods for simultaneous profiling of protein and mRNA expression in single cells (Stoeckius et al. 2017) to better understand the interplay between APA, gene expression, and membrane localization of CD47 in cancers.
The concept that 3 ′ UTR shortening is associated with proliferative states was challenged in a recent study that instead demonstrated its association with the secretion of proteins, both in trophoblast and in plasma cells (Cheng et al. 2020). Our data fully support this notion, extending the data to cancer cells as well as T lymphocytes and myeloid cells. As the protein production apparatus is present in all cells, APA is a well-suited mechanism for fine-tuning the expression of various components in a celltype-and cell-state-dependent manner (Lianoglou et al. 2013). Associating APA with protein metabolism rather than cell proliferation makes the question of its upstream regulation ever more puzzling because the shortening of 3 ′ UTRs in proliferating cells has been attributed to an increased expression of 3 ′ end processing factors mediated by cell cycleassociated E2F transcription factors (Elkon et al. 2012). It will be interesting to revisit this issue in a system where the increased protein production and secretion can be decoupled from cell proliferation, as the B cell maturation system (Cheng et al. 2020).
In conclusion, among the many applications of scRNAseq, analysis of cell-type-dependent polyadenylation reveals the relevance of APA as a general mechanism for regulating the metabolism and traffic of proteins within cells. With SCUREL we provide a robust method for detecting changes in 3 ′ UTR length for even low-expression genes between cell types, in a manner that does not rely on accurate PAS annotation.

Lung cancer samples
Lung adenocarcinoma (LUAD) and matched control samples were downloaded from the GEO database (Barrett et al. 2013), based on the accession numbers in the original publications. Specifically, from the Lambrechts et al. (2018) data set we used the LUAD samples listed in Table 1 2020) study we also used LUAD and matched control samples, which originated from three donors. These samples were also generated with the 10x Genomics Single Cell 3 ′ V2 protocol (accession number GSE123904).

Mouse testis samples
scRNA-seq data from the testes of two 8-wk-old C57BL/6J mice (Lukassen et al. 2018) were downloaded from the GEO database (accession number GSE104556).

Mouse T cell samples
scRNA-seq data of FACS sorted T cells from the lymph nodes and spleen of C57BL/6J mice, three infected with OVAexpressing Lysteria monocytogenes and one naive (Pace et al. 2018) were downloaded from the GEO database (accession number GSE106268).
Execution of scAPA scAPA (Shulman and Elkon 2019) was downloaded from the github repository and executed with the same genome sequence that was used throughout the study. For compatibility, the "chr" prefix in the chromosome names was removed. The lengths of the chromosomes were obtained with samtools faidx. The homer software (v4.11.1) required by the scAPA package was manually downloaded from http://homer.ucsd.edu/homer/. We collected all other requirements specified on scAPA github page in a conda environment. The removal of duplicate reads was done by adjusting the existing umi_tools dedup command in scAPA.shell.script.R for 10x Genomics, using the following options " -per-gene," " -gene-tag=GX," "per-cell." This was necessary because according to the protocol, one RNA fragment could result in reads that do not map at identical positions.

Extraction of terminal exons
Terminal exons were obtained from the RefSeq genome annotations (gff), GRCm38.p6 for mouse and GRCh38.p13 for human, with a custom script, as follows. Chromosome names from the RefSeq assembly were converted to ENSEMBL-type names based on the accompanying "assembly_report.txt" file. Only au-tosomes, allosomes and mitochondrial DNA were retained. Based on the genome annotation file, protein-coding and long noncoding transcripts were retained, while model transcripts ("Gnomon" prediction; accession prefixes XM_, XR_, and XP_) were discarded. From this transcript set, the 3 ′ -most exons (i.e., terminal exons, TEs) were retrieved. Overlapping TEs on the same chromosome strand were clustered with intervaltree (v3.0.2; python package) and from each cluster, the longest exon was kept. The resulting set of TEs was sorted by chromosome and start position and saved to a BED-formatted file. TE IDs were converted to gene names with biomaRt (v 2.46.3) using the ensembl BioMart database. Duplicate gene names were discarded.

Processing of scRNA-seq reads
The workflow can start from mapped reads in cellranger-compatible format, a file with cell barcode-to-cell-type annotation and a genome annotation file. Alternatively, the cellranger count function can be used to map reads from FASTQ input data. Reads from the FASTQ files were mapped with the function count from the cellranger (v5.0.0) package to the reference human genome GRCh38-3.0.0 sequence obtained directly from the 10x genomics website. This genome is a modified version of the GRCh38 genome, compatible with the cellranger analysis pipeline. Reads are also aligned to the transcriptome. In this step, cell barcodes and UMIs correction also takes place. Aligned reads (BAM) with mapping quality (MAPQ) scores >30 were selected with samtools (v1.12 [Li et al. 2009]). Reads without a cell barcode "CB" tag were removed with samtools view, as were duplicated reads using umi_tools dedup (v1.1.1 [Smith et al. 2017]). The mapped reads are filtered, deduplicated and grouped by cell type in the "cell type" mode or by cell type and tissue of origin in the "condition" mode. In the latter case, quasi-bulk samples are also constructed from the filtered reads that come from individual conditions.

Cell-type annotation
The annotation of cell types in all data sets was carried out with the approach described in Lambrechts et al. (2018). Filtered data (so as to remove artifacts such as empty droplets) consisting of cellular barcodes and count matrices from individual data sets were loaded in R (v4.0.3) with Read10X (from Seurat v3.2.3 [Butler et al. 2018]), and Seurat objects were created with CreateSeuratObject. For the lung cancer data sets, cells with <201 unique molecular identifiers (UMIs), with <101 or >6000 genes or with >10% UMIs from mitochondrial genes (which may indicate apoptotic or damaged cells) were removed. For all data sets, genes with zero variance across all cells (i.e., sum = 0) were discarded. The gene expression counts for each cell were log-normalized with NormalizeData with a default scale factor of 10,000. In Seurat, 2000-2500 most variable genes are used to cluster the cells. Here we used the 2192 variable most variable genes, as in Lambrechts et al. (2018). These were selected with FindVariableFeatures, with normalized expression between 0.125 and 3, and a quantile-normalized variance exceeding 0.5 for lung cancer and mouse T cell samples, and normalized expression between 0.1 and 8 for mouse testis samples. Gene expression levels were then centered and scaled across all cells. After principal component analysis (PCA) on the most variable genes, the number of relevant dimensions n for each data set was determined by assessing the variance explained by individual principal components (PC) with ElbowPlot from Seurat. UMAP (McInnes et al. 2018) was used to visualize the data projected on the n dimensions. For T cell activation and LUAD samples, batch correction and data integration were performed with harmony (v1.0) (Korsunsky et al. 2019). Harmony was run on the first 30 PCs and set to group by data set. The transformed data set was used for downstream analysis (i.e., clustering of cells, visualization in 2D). Various Seurat functions were used to identify the cell type of individual cells. Cells were clustered using the shared nearest neighbor (SNN) algorithm, which aims to optimize modularity. First, FindNeighbors was executed using the first n dimensions from PCA or harmony and with otherwise default settings (k = 20). Then, FindClusters with resolution parameter 0.6 for LUAD, 0.2 for T cells and 0.3 for spermatocytes was run, so as to retrieve a number of clusters similar to those in the original publications (Lambrechts et al. 2018;Lukassen et al. 2018;Pace et al. 2018;Laughney et al. 2020). The expression of cell-type markers in each cluster was assessed with FindAllMarkers. This function finds genes that are differentially expressed between cells from one cluster and all other cells, by applying a Wilcoxon rank sum test on the log-normalized expression. Individual clusters were downsampled to the number of cells in the smallest cluster or to at least 100 cells. Only genes expressed in a minimum of 10% of the cells in either population and with a log (base e) fold-change of at least 0.25 (default values in Seurat) were tested. Markers with adjusted P-value <0.01 were considered significant and those with higher expression in the selected cluster were considered as potential markers for that cell cluster. For each cluster we counted the number of significant markers that matched known cell-type markers (Table 1) and assigned the cell type to be the one for which a proportion of >0.6 of known markers were specifically expressed in the cell cluster. Generally, this assignment was unambiguous, and when it was not, the cell-type assignment was done manually, taking into account the adjusted P-value and average log-foldchange of all considered marker genes as well as the cell-type annotation from the Supplemental Table 3 of Lambrechts et al. (2018), which contains additional cell-type markers. At least three marker genes were required to assign a cluster to the corresponding cell type, except for cancer cells that were annotated only based on the expression of EPCAM.
Assessing 3 ′ ′ ′ ′ ′ UTR length differences with the AUC measure To assess changes in 3 ′ UTR length between groups of cells we used the following approach. For simplicity, the analysis is carried out for terminal exons (TEs) rather than 3 ′ UTRs, as 3 ′ UTRs are generally contained in TEs, covering almost the entire length of the TEs. We started from the BAM files of mapped reads from two groups of cells. We computed the 3 ′ end coverage of individual TEs per strand with bedtools genomecov and parameter "-bga." The BED file with read 3 ′ end positions was used to obtain the normalized reverse cumulative coverage of individual TEs, that is, starting at the TE 3 ′ end and ending at the 5 ′ most nucleotide. Individual TEs were traversed from the end to the beginning, recording the reverse cumulative coverage in the two groups of cells as a function of position. The area under the resulting curve (AUC) was then calculated. An AUC of 0.5 corresponds to identical position-dependent coverage of the TE by 3 ′ end reads in the two groups of cells, that is, no difference in TE length. An AUC value of 1 corresponds to all the 3 ′ end reads from the group of cells indicated on the y-axis being clustered at the end of the TE, before any reads from the other group are observed, thus the TEs are longest in this group of cells. Vice versa, an AUC value of 0 corresponds to all the 3 ′ end reads from the group indicated on the x-axis are observed before any reads of the other group, thus the 3 ′ UTRs are longest in this group of cells.
If the read coverage of a TE is very sparse, the curve representing the coverage in the two cell groups will not be smooth, but rather change in steps of 1/n where n is the number of reads mapping to the TE; deviations from the diagonal line of identical coverage in the two groups will be common, due to the stochastic sampling of the reads. To mitigate this effect and identify TEs whose coverage cannot be explained by stochastic sampling of low-expression genes, we generated a background data set, in which we randomized the cell group label of the reads. This procedure preserves the number of reads obtained in each TE in each group, but randomizes their position in the TE.
Finally, we identified TEs with AUCs indicating significant shifts in PAS usage. For this, we extracted TEs with a normalized read count (CPM) ≥ 2 in both cell groups, roughly corresponding to TEs with at least one count in each of the groups. As AUC values depend on the overall expression of the TE, we used an expression-dependent AUC cutoff to identify the TEs significantly changing length. This corresponded to the two-tailed 1% quantile of the background distribution in each of the 20 equal-sized log(mean expression between cell groups) bins, smoothened using the median over a running window of five values. Finally, to ensure that the change in read coverage was due to APA, we only retained significantly changed TEs for which the union of the interquartile range of TE positions that were covered by 3 ′ end reads in the two samples spanned at least 200 nt.

Analysis of overlaps between data sets
We used a sample-specific background for the calculation of the probability of overlap of genes and for the pathway enrichment analysis carried out on the STRING web server. All TEs considered in the AUC analysis, that is, TEs with CPM ≥ 2, in each sample were combined and the unique set of TEs was used as background. In particular, for the cell-type analysis of the Lambrechts data set, we used the cell-type-specific union of TEs from patients 3, 4, and 6 and obtained 10,966 genes for myeloid cells, 10,473 for T cells, 11,269 for endothelial cells, and 11,857 for fibroblasts. For the cell-type analysis of lung cancer data sets, the union of TEs consisted of 10,177 genes in T cells and 9970 genes in myeloid cells. We used the hypergeometric distribution to calculate the odds ratio and associated P-value of the overlap between gene sets.

Pathway analysis
The gene symbols for TEs with significant APA events were analyzed via the STRING web server, which provides enriched Gene ontology (GO) terms, KEGG and reactome pathways. As a background gene set for the enrichment analysis we provided the data set-specific list of expressed genes (CPM ≥ 2).

SUPPLEMENTAL MATERIAL
Supplemental material is available for this article.