We first sought to construct a score that captures the genome-wide intolerance to variation profile. We applied a tiled gwRVIS to WGS data from 62,784 individuals available in the TOPMed dataset13 (Freeze5 release). The original RVIS score1 quantifies intolerance by regressing the number of observed common functional (missense and protein-truncating) variants on the total number of observed variants in a gene. The resulting regression line predicts the expected number of common functional variants, and the deviation of each gene from this expectation (more or less variation than expected) is calculated as the residual divided by an estimate of its standard deviation (studentized residual).
In the genome-wide approach, we no longer have pre-defined genomic units (like genes in RVIS) or functional annotations for variants. Thus, we scan the entire genome with a sliding-window approach (using a 1-nucleotide step), recording the number of all variants and common variants, irrespective of their predicted effect, within each window, to eventually calculate a single-nucleotide resolution genome-wide intolerance score (Fig. 1).
Flowchart for calculation of single-nucleotide resolution gwRVIS from whole-genome sequencing population genetic data (e.g., TOPMed and gnomAD), including relevant filtering steps (regarding depth coverage, repeat sequences, and variant quality).
The gwRVIS method is based on two hyperparameters: the length of the window and the minor allele frequency (MAF) threshold over which we consider a variant as common. We have fine-tuned these parameters by testing a range of values (Supplementary Fig. 1) and identified low sensitivity around the exact value selection. Taking into consideration the largest segregation achieved between the most intolerant and tolerant genomic classes (see “Methods”) we selected a window length of 3 kb and an MAF threshold of 0.1%. As the number and ancestral diversity of sequenced human genomes increases, we expect this will permit smaller window sizes, which would provide better resolution for identifying smaller noncoding intolerant regions of the human genome. At the same time, there’s been demonstrated value from using longer windows too, as previous works have shown that genetic contribution of variants relies on the genomic context of broader regions14, including GC content, which has been correlated with various features of genome organization such as mutation rate and distribution of various classes of repeated elements15.
In constructing the scores, we also perform a variant pre-processing quality control step, accounting for genomic coverage, variant-calling confidence, and simple repeat regions (Fig. 1; see “Methods”). We then fit an ordinary linear regression model to predict common variants based on the total number of all variants found in each window. As with the original RVIS formulation, we define the studentized residuals of this regression model as gwRVIS, with lower gwRVIS values corresponding to greater intolerance (Fig. 2a). Notably, a set of highly tolerant windows emerges, particularly enriched for chromosome 6 and specifically for regions of the human leukocyte antigen complex (Fig. 2b), consistent with a previously reported high degree of positive selection in this region of the genome16.
a Intolerance to variation profile across all genomic windows (having 3 kb length). A regression line (shown in green) is fit between “all” and “common” (MAF > 0.1%) variants across all windows. gwRVIS can be visualized as the vertical distance of each data point from the regression line (prior to normalization by the standard deviation of the total distribution). Red dots represent the top 1% of most intolerant windows (i.e., having fewer common variants than expected) while blue dots represent the top 1% of most tolerant windows. b gwRVIS distribution across all chromosomes, as extracted from the TOPMed dataset. A highly tolerant set of windows in chromosome 6 is enriched for HLA complex regions. c gwRVIS scores distribution (with single-nucleotide resolution) across different sets of mutually exclusive CCDS windows: OMIM-Haploinsufficient, 25% most intolerant (based on RVIS), 25% most tolerant, and rest of CCDS. P values from two-sided Mann–Whitney U tests are also provided for each pair of “adjacent” coding region classes in order of increasing intolerance. d Distribution of gwRVIS scores across different coding and noncoding genomic classes, in descending order of intolerance to variation: UCNEs, VISTA enhancers, UTRs, CCDS, introns, lincRNAs, and intergenic regions. The red horizontal dashed line (gwRVIS = 0) represents the mean of the theoretical null distribution (i.e., where the observed number of common variants equals the expected number). Intergenic regions are normally distributed around the null distribution, which validates their use as an empirical null distribution. Two-sided Mann–Whitney U has been employed to compare the gwRVIS distributions across all pairs of genomic classes (***p < 1 × 10−308). For each boxplot, its central line represents the median, the bounds represent the 25th and 75th percentile, and the whiskers extend up to 1.5 the interquartile range from the respective bounds.
Stratifying the human genome based on intolerance to variation
To determine biological relevance of gwRVIS we first sought to confirm the ability of gwRVIS to differentiate between different classes of protein-coding consensus coding sequence (CCDS) windows based on their disease relevance. We looked at the gwRVIS distributions across four sets of CCDS windows: OMIM-Haploinsufficient, 25% most intolerant (as defined by RVIS), 25% most tolerant (as defined by RVIS), and the remainder of CCDS (Fig. 2c). Despite not incorporating functional protein-coding annotations in constructing gwRVIS, we observe that it still manages to correctly stratify the four CCDS sets based on their expected levels of constraint, grouping them in order of decreasing intolerance to variation as follows: OMIM-Haploinsufficient, 25% most intolerant CCDS, rest of CCDS and 25% most tolerant CCDS (Fig. 2c). Any two adjacent CCDS sets in this ranking have returned genome-wide statistically significant difference (Mann–Whitney U; p < 5 × 10−8), except for the OMIM-Haploinsufficient and 25% most intolerant CCDS sets with a non genome-wide significant divergence (Mann–Whitney U; p = 6.8 × 10−5), which is expected as they both comprise of highly intolerant subsets of the human exome.
Having shown the ability of gwRVIS to successfully predict human disease based on greater intolerance detected among the protein-coding CCDS windows overlapping human disease genes, we next compared the intolerance of different regulatory genomic classes, as captured by gwRVIS. We studied seven major genomic classes: intergenic regions, lincRNAs, introns, CCDS, UTRs, VISTA enhancers6, and ultraconserved noncoding elements or UCNEs (see “Methods”), listed here in ascending order of inferred intolerance to variation (Fig. 2d). The intergenic gwRVIS score distribution emerges as the most tolerant class with a median gwRVIS = −0.0014. This median gwRVIS closely aligns with the theoretical null distribution defined by gwRVIS = 0, reflecting an equal number of observed and expected common variants. This validates the instinctive use of intergenic gwRVIS distribution as an empirical null distribution throughout the rest of the paper.
Surprisingly, the CCDS protein-coding region of the genome was not the most intolerant functional class. We observed that UCNEs17 (highly conserved noncoding regions between human and chicken) are ranked as the most intra-species intolerant class (median gwRVIS: −0.99; Mann–Whitney U vs. intergenic: p < 1 × 10−308), and this is despite gwRVIS not using any information about species conservation in its construction (Fig. 2d). This is consistent with previous observations that UCNEs are depleted of common variants in humans18. VISTA enhancers (a class of highly conserved enhancers active during embryonic development) and CCDS follow with the next highest intolerance to variation profile, very similar to UTRs (median gwRVIS: −0.77, −0.55, and −0.51; Mann–Whitney U vs. intergenic: p < 1 × 10−308 for VISTA enhancers, CCDS and UTRs, respectively). Finally, introns and lincRNAs have a more tolerant gwRVIS score distribution that more closely resembles the distribution from intergenic regions, but due to the sheer size of the corresponding score lists the divergence remains highly significant (median scores: −0.050, −0.0015; Mann–Whitney U vs. intergenic: p < 1 × 10−308 and p = 2.6 × 10−168, respectively).
An important aspect to note is that these comparisons currently reflect the class distribution effect of the seven mutually exclusive classes studied. It is evident that within each class there are more and less tolerant windows in the human genome (Fig. 2d and Supplementary Fig. 2a). Thus, even among the intron and intergenic windows, there are some windows that achieve lower gwRVIS estimates than other windows overlapping with, for example, the CCDS sequence, and herein lies the true potential of this score to facilitate identification of more critical regions across all annotated regulatory classes.
We then test the predictive power of gwRVIS score distribution to distinguish windows overlapping the most intolerant (UCNEs) and tolerant (intergenic) genomic classes. We fit a logistic regression model with fivefold cross-validation to classify UCNEs- and intergenic-derived genomic windows, solely based on the gwRVIS score and find that gwRVIS achieves a notable area under curve (AUC) performance of 0.81 (standard deviation: 0.05) in distinguishing UCNE overlapping windows from all intergenic windows (Supplementary Fig. 2b). We also fit the same model but focused only on windows under negative selection (gwRVIS < 0), as detection of positive selection has been proven challenging19. In this case, gwRVIS achieves an even better performance with an AUC score of 0.86 (standard deviation: 0.02; Supplementary Fig. 2c).
gwRVIS predictive power for functionally important genomic elements
As we inspect the gwRVIS distributions across the different genomic classes, we observe a large variance in intolerance to variation within each class (Supplementary Fig. 2a). Thus, beyond the insight about which individual genomic classes are more/less intolerant relative to other classes, the gwRVIS distribution within a class also provides an extra dimension when trying to prioritize elements with the same functional annotation in terms of their pathogenicity potential or biological importance.
We earlier showed this to be true within the CCDS genomic class, in which known human disease genes had significantly lower gwRVIS than human CCDS genes not linked to human disease. To assess this signal outside of the protein-coding sequence, we first looked at the FANTOM520 gwRVIS scores to examine where VISTA enhancers21 preferentially fall within that distribution. FANTOM5 is a resource that contains mammalian promoters, enhancers, lincRNAs, and miRNAs, including a collection of nearly 20,000 functional lincRNAs in humans. VISTA enhancers are important as they represent experimentally validated human and mouse noncoding fragments with gene enhancer activity. We found, that among the 10% most intolerant FANTOM5-overlapping windows there was significant enrichment for VISTA enhancers compared to the 10% most tolerant end (approx. 7.5% vs. 1.5%, i.e., 5-fold enrichment; Fisher’s exact test p value = 1.85 × 10−62; Supplementary Fig. 3). This result demonstrates that the intolerant tail of the gwRVIS distribution for enhancers is more likely to be enriched for genomic elements of increased functional significance.
We have also compared gwRVIS to Orion9, another method that relies on human genomic constraint, and found that gwRVIS can better predict CCDS vs. non-CCDS regions (Mann–Whitney p value = 8.4 × 10−15 compared to 0.001 when using Orion; Supplementary Fig. 4; see “Methods”).
Finally, in order to evaluate the sensitivity on different WGS datasets used as input for the intolerance score construction, we also calculated gwRVIS based on the gnomAD dataset (n = 15,708 individuals) and found it to be significantly correlated with the scores constructed using the TOPMed dataset (Pearson’s r: 0.91; p value < 2.2 × 10−16; Supplementary Fig. 5; see “Methods”), indicating the robustness of the method on comparable WGS datasets.
Classification of noncoding pathogenic variants based on their intolerance to variation
Overall, noncoding variants represent a small fraction of all pathogenic classified variants residing among curated variant-level resources such as ClinVar22. Here, we examine the properties of gwRVIS in the context of ClinVar clinically classified pathogenic noncoding variants. We compiled two lists of noncoding variants: a pathogenic set based on ClinVar and a set of benign variants based on the “control” variants from denovo-db23 (see “Methods”). We have intentionally avoided constructing a negative set of benign variants based on common variants found in a large population cohort (e.g., gnomAD or TOPMED) due to the fact that gwRVIS construction is inherently informed by common variants distribution and, thus, this would introduce circularity in the model’s prediction performance.
Across the six non-CCDS genomic classes (Fig. 2d), we identified the following numbers of noncoding pathogenic variants (Supplementary Fig. 6): 427 in UTRs, 47 in intergenic regions, 47 in lincRNAs, 2 in VISTA enhancers, 0 in UCNEs, and 5052 in introns (across 125, 22, 6, 2, 0 and 1606 unique 3 kb windows, respectively). We also captured 51,230 variants across 3618 unique windows for CCDS regions based on the ClinVar annotation.
We then trained a logistic regression model with fivefold cross-validation, using gwRVIS or another genome-wide score (CADD24, phastCons46way25, phyloP46way26, and Orion9) as the only independent variable predictor (see “Methods”). We focused on the four noncoding genetic classes that have at least 20 distinct noncoding windows harboring pathogenic variants (lincRNAs, UTRs, intergenic, and introns). Remarkably, we observed that gwRVIS achieves the highest performance in pathogenic variant classification from lincRNA regions (AUC = 0.937; Fig. 3a), which is significantly better than the performance of phyloP46way, phastCons46way and Orion but not from CADD (DeLong test: p = 5.83 × 10−3; p = 1.7 × 10−3 p = 0.012 and p = 0.64, respectively; Supplementary Fig. 16 and Supplementary Table 1). Similarly, gwRVIS performs better than all other scores in intergenic regions (AUC = 0.763; Fig. 3b), with significantly better performance than phyloP46way, phastCons46way and Orion but not CADD (DeLong test: p = 2.65 × 10−6; p = 6.2 × 10−3 p = 0.02 and p = 0.34, respectively; Supplementary Fig. 16 and Supplementary Table 2). In UTRs, gwRVIS’s performance drops but remains significantly higher than Orion (AUCs: 0.732 vs. 0.597, DeLong test p = 4.71 × 10−4; Fig. 3c and Supplementary Table 3).
Mean ROC curves (with fivefold cross-validation) from gwRVIS benchmarking against CADD, phastCons (46-way), phyloP (46 way), and Orion, during ClinVar-pathogenic vs. denovodb-benign variant classification for three noncoding genomic classes: a lincRNAs, b intergenic regions, and c UTRs. The combined performance of gwRVIS with CADD is also shown. ncRVIS is included in the benchmarking of the UTR regions (c), as a robust score specifically designed for the UTR genomic class. One-sided DeLong’s tests have been performed to assess the statistical significance of the differences in predictive power between gwRVIS and the rest of genome-wide scores.
In order to estimate the contribution of gwRVIS information in noncoding variant detection, we also trained a multiple logistic regression model using gwRVIS and CADD as the independent variables. We observe, that gwRVIS significantly boosts CADD’s performance (Fig. 3) in lincRNAs (AUC: increased to 0.937 from 0.895; DeLong test p = 0.036), intergenic regions (AUC: increased to 0.809 from 0.741, even though non significantly; DeLong test p = 0.07) and UTRs (AUC: increased to 0.835 from 0.777; DeLong test p = 4.18 × 10−23). This indicates that gwRVIS captures orthogonal information that is not represented among the 63 distinct annotations/features employed by CADD. Moreover, although ncRVIS is the top-performing single score in UTRs (AUC = 0.823), it is lower than the combined gwRVIS and CADD score (AUC = 0.835).
We have also observed that gwRVIS is not superior in classifying pathogenic variants within protein-coding CCDS or intronic regions (Supplementary Fig. 7). This is expected as the intention of gwRVIS is to support the interpretation of the noncoding sequence where, unlike the protein-coding sequence, there is no information about variant effects. Overall, based on its optimal performance in non-coding regions using a cross-validation framework and the performance boost in the combined model with CADD, gwRVIS values are likely to be highly generalizable to other datasets when seeking to prioritize candidate variants in the non-coding genome.
A multi-module deep learning framework for non-coding variant pathogenicity inference
Equipped with a human-lineage-specific constraint score that spans the entire human genome we next sought to further improve our ability to prioritize noncoding sequence by integrating additional information beyond gwRVIS. To this end, we integrate two additional layers of information: (a) primary genomic sequence context around each variant (unstructured data) and (b) genomic annotations such as methylation, chromatin accessibility, or other structured features extracted from raw genomic sequences, such as GC content and mutability rate (see “Methods”).
By combining this information (gwRVIS, primary genomic sequence context, and additional genomic annotations) we developed “Junk Annotation” RVIS or JARVIS: a multi-module deep learning framework for pathogenicity inference of noncoding regions that still remains naïve to existing phylogenetic conservation metrics in its score construction (Fig. 4a). We trained four different models for JARVIS: (a) Gradient Boosting using structured features (i.e., without raw sequence information), (b) feed-forward deep neural net (DNN) using structured features, (c) convolutional neural net (CNN) with raw sequence input, and (d) the multi-module neural network model that combines information from both structured features and raw sequences. The multi-module model outperformed all other models used for training JARVIS on the ClinVar pathogenic variant set (Fig. 4b and Supplementary Fig. 8), achieving an AUC of 0.940 with fivefold cross-validation (compared to AUC scores of 0.930, 0.929, and 0.872, from the DNN, Gradient Boosting, and CNN models, respectively). Thus, we define as JARVIS the scores extracted by the multi-module model, which comprises of a CNN module for information inference from underlying sequence and a feed-forward DNN to assess structured feature data such as gwRVIS, sequence-derived features, and external annotations (see also “Methods”).
a Deep-learning framework for noncoding variants pathogenicity inference based on different types of annotation: genome-wide residual variation intolerance score (gwRVIS), primary genomic sequence, structured features extracted by raw genomic sequence (mutability rate, GC content, etc.) and additional annotations from Ensembl (histone marks, chromatin accessibility, CTCF-binding sites, etc.). All structured features are initially passed onto a two-layer deep neural net (DNN). Primary genomic sequences (in windows of 3 kb) are fed into a deep convolutional neural net and then flattened prior to merging with the higher representations of the structured features previously processed by the DNN. The combined higher representations of features are processed by two additional fully connected layers, followed by a “softmax” output, which gives the pathogenicity likelihood for each input variant as a probability score. b JARVIS performance with fivefold cross-validation after training with a multi-module deep neural network, using all noncoding ClinVar-based pathogenic variants and a matched set of putative benign variants from denovo-db. Variants used for training belong to any of the following genomic classes: intergenic regions, UTRs, lincRNAs, UCNEs, or VISTA enhancers. A total of 521 noncoding pathogenic variants have been used for this classification task, thus N = 1042 represents the total size of the training set (using a set of control variants of equal size). Performance for the rest of the genome-wide scores shown here has been calculated using a logistic regression model with fivefold cross-validation on the JARVIS training set. c An overview of the relative importance of the structured features integrated within JARVIS, as they are extracted by a Gradient Boosting classifier following an impurity-based feature selection algorithm.
As with gwRVIS, our training set adopted all non-coding variants annotated in ClinVar as “Pathogenic” or “Likely pathogenic” and a random subset of “control” variants from denovo-db23, considered to be benign (see “Methods”). To build a generic noncoding variant classification model, we integrated variant data from five noncoding regions during training: intergenic regions, UTRs, lincRNAs, UCNEs, and VISTA enhancers. We did not include introns given the low empirical performance of gwRVIS in these regions.
We compare JARVIS against ten other popular genome-wide scores (Fig. 4b and Supplementary Fig. 9): Orion9, CADD24, phastCons25 (46way), phylop26 (46way), DeepSEA27, ncER28, LINSIGHT29, Eigen-PC30, DANN31, and CDTS10. It is important to note that ncER, LINSIGHT, CADD, and DANN incorporate multiple phylogenetic conservation metrics (e.g., phyloP, phastCons, SiPhy, and CEGA) in their score constructions. It is expected that a large proportion of ClinVar variants are annotated as pathogenic based on occurring at evolutionarily conserved sequences22 among other evidence. This can introduce circularity when using conservation-based metrics as a predictor of literature-based pathogenic noncoding variants. In addition, these four scores have been directly or indirectly trained with a subset of the JARVIS training set. Specifically, ncER has been trained based on ClinVar (and HGMD) pathogenic variants, which comprise a subset of the pathogenic set used by JARVIS and thus its performance on the JARVIS training set may suffer from data leakage. In addition, LINSIGHT, ncER, CADD, and DANN have been trained with prior knowledge of proximity to CCDS (distance to TSS) and annotation of a region as a predicted distal regulatory module. A large proportion (>55%29) of ClinVar noncoding pathogenic variants are proximal to coding regions (classified as splicing or UTR variants) while a large part of the rest has been annotated based on a known regulatory effect22. Thus, ncER, LINSIGHT, CADD, and DANN could be impacted by ascertainment bias due to the variant distribution of the JARVIS training set with respect to closest TSS, however, we still report their performance for reference.
We trained the deep learning-based multi-module JARVIS model using all ClinVar noncoding pathogenic variants across all chromosomes with fivefold cross-validation (randomized splits) and compared its performance against the rest of the scores (see “Methods”). JARVIS performs significantly better than all other scores (AUC = 0.940; Fig. 4b; DeLong test p < 2.710−4; p values from each test available in Supplementary Fig. 19a and Supplementary Table 4), except for DeepSEA (AUC = 0.945; Fig. 4b; DeLong test: nonsignificant p = 0.064), despite some of them including conservation information. Two scores, ncER and LINSIGHT, achieved better performance on this dataset (AUC: 0.977 and 0.961; DeLong test: p = 2 × 10−5; p = 0.024, respectively; Supplementary Fig. 19a and Supplementary Table 4). However, ncER’s predictions are likely boosted by data leakage of the current JARVIS training set in its own training set and both scores may be biased with additional information, such as distance from the closest TSS. When integrating the TSS distance in the JARVIS model, this version of JARVIS indeed exceeds the performance of both LINSIGHT and ncER (AUC = 0.984; Supplementary Fig. 9). However, we do not eventually include TSS distance as a feature in the final model as we want to avoid biasing JARVIS predictions toward variants residing closer to protein-coding regions.
The fully randomized cross-validation strategy for assessing JARVIS training performance may be prone to overestimation of its generalizable performance due to the annotated ClinVar pathogenic noncoding variants being preferentially closer to protein-coding genes compared to a random set of control variants derived from denovo-db. We have thus prepared an alternative version of the training set (matched training set) by selecting control variants from denovo-db with very similar distribution of TSS distances to closest genes, as compared to the pathogenic variants employed in the JARVIS training set (Supplementary Fig. 17). JARVIS performs significantly better than all other scores (AUC = 0.800; Supplementary Fig. 18; DeLong test p < 0.0037, Supplementary Fig. 19b and Supplementary Table 5) except for the only ncER, which has significantly better performance (AUC = 870; DeLong test p = 4.93 × 10−5) but likely boosted by data leakage.
We also employed an alternative cross-validation strategy by stratifying variants based on their chromosome location. This strategy ensures that variants from the same genomic region cannot be part of both the training and test sets at any cross-validation step, thus removing any bias from data circularity. JARVIS performance dropped marginally (AUC = 0.929, Supplementary Fig. 18) but remained significantly higher than all others (DeLong test p < 5.510−11; Supplementary Fig. 19c and Supplementary Table 6) except for DeepSEA and CADD that performed nonsignificantly worse (AUC = 0.922 and 0.913, DeLong test p = 0.065 and 0.158, respectively) and ncER and LINSIGHT that performed significantly better (AUC = 0.980 and 0.969, DeLong test p = 1.16 × 10−8 and 2 × 10−4, respectively).
Based on the performance of JARVIS when using different models for training (Fig. 4b and Supplementary Fig. 8), we observe that deep learning models are superior to Gradient Boosting and also that the inclusion of raw sequences features provides the highest predictive power in the pathogenic variant classification task from non-coding regions. In order to assess the relative contribution of the structured features, we initially performed a correlation analysis (based on Pearson’s r coefficients) between all pairs of structured features and observed that gwRVIS has little correlation with the rest, highlighting that it represents an orthogonal type of information not routinely captured by the rest of the JARVIS features (Supplementary Fig. 10). The sequence-derived features (GC content, mutability rate, etc.) and some of the histone marks were found to be highly correlated with each other within the respective feature group (Supplementary Fig. 10).
It is, however, difficult to infer the real contribution from each feature employed by the full deep learning-based JARVIS model. Thus, we employ an impurity-based feature extraction algorithm with a Gradient Boosting classifier as a proxy to infer the relative contribution of each of the structured features. We observe that gwRVIS ranks second in feature importance while the sequence-derived features, specifically CpG density, are the other most dominant subset from the entire feature set (Fig. 4c). Notably, CpG density in promoter regions has been associated with loss-of-function intolerance of proximal coding regions32. Thus, JARVIS may preferentially prioritize noncoding variants that have a traceable functional effect on coding regions. Functional annotations follow with lower contribution but still carrying a considerable burden, especially for certain types of histone marks. These findings demonstrate that the genome-wide intolerance score, as it is captured by gwRVIS, adds considerable value to the predictive power of JARVIS.
We also sought to explore the feature contribution from the CNN module of the JARVIS framework and assess whether it has learnt any biologically meaningful motifs. For this analysis, we relied on previous approaches that look into the most highly activated filters that comprise the feature maps at the output of the first convolutional layer33. Specifically, we selected the most activated k-mers (where k = 11, equal to the selected filter size) in the training set of pathogenic input sequences and aligned them with known motifs (see Methods). We observed that, despite the fairly small size of the training set, JARVIS CNN module has managed to learn dozens of sequence patterns that align with known vertebrate, human or mouse motifs, many of which are enriched for long Cytosine stretches and thus GC base pairs too (Supplementary Figs. 22, 23, 24 and Supplementary Data 1).
JARVIS model validation on independent datasets
To further assess the generalized performance of JARVIS, we screened throughout the literature to identify additional independent sets of noncoding variants, either annotated as pathogenic or of high functional importance. For this task, we employed four different sets of noncoding variants described in Wells et al.28 (Fig. 5): (a) a set of 631 genome-wide association studies (GWAS) hit single-nucleotide variants (SNVs), (b) a set of 54 noncoding variants associated with Mendelian traits, and (c, d) two “generalization” test sets from noncoding RNAs or other regions28, with 35 and 17 hits, respectively. In all cases, we made sure to remove from each independent validation set any variants that were included in the JARVIS training set to avoid over-optimistic predictions induced by data leakage.
ROC curves from prediction on different sets of noncoding variants (falling into intergenic regions, UTRs, lincRNAs, UCNEs, and VISTA enhancers) not used during JARVIS training. In each case, a benign set of equal size has been randomly subset from the denovo-db control variants, avoiding any overlaps with the pathogenic variants in each of the validation sets. a GWAS hit SNVs (n = 1262). b Noncoding variants with mendelian traits (n = 118). c Generalization test set (ncRNA; n = 70). d Generalization test set (other; n = 34). In each plot, n refers to the total size of the validation set, including both pathogenic variants and a sample of control variants of equal size.
JARVIS significantly outperforms all scores in classifying GWAS hit SNVs (AUC = 0.760 vs. 0.661 from the second-ranked LINSIGHT; Fig. 5a—DeLong test p < 7.65 × 10−7; Supplementary Fig. 20 and Supplementary Table 7). It also performs significantly better than all other scores in the prediction of Mendelian noncoding variants (AUC = 0.988, DeLong test p < 6.70 × 10−3; Supplementary Fig. 20 and Supplementary Table 8) except for ncER which achieves a nonsignificantly higher performance (AUC = 0.990; Fig. 5b). In the third validation set (ncER-derived “ncRNA” generalization dataset) JARVIS ranks fourth (AUC: 0.956, Fig. 5c), significantly outperforming Eigen-PC, DANN, Orion, phastCons, and CDTS (DeLong test p < 5.18 × 10−3, Supplementary Fig. 20 and Supplementary Table 9), with LINSIGHT, ncER, and DeepSEA performing nonsignificantly better (AUC = 0.991, 0.963, 0.962; DeLong test p = 0.34, 0.36, and 0.44, respectively; Supplementary Fig. 20 and Supplementary Table 9). Similarly, in the fourth validation set (ncER-derived “other” generalization dataset) JARVIS ranks fourth (AUC: 0.914, Fig. 5d), significantly outperforming phyloP, phastCons, DANN, Eigen-PC, Orion, and CDTS (DeLong test p < 1.66 × 10−3, Supplementary Fig. 20 and Supplementary Table 10), with LINSIGHT, ncER, and DeepSEA performing again nonsignificantly better (AUC = 0.988, 0.969, 0.938; DeLong test p = 0.32, 0.59, and 0.84, respectively; Supplementary Fig. 20 and Supplementary Table 10).
Notably, JARVIS performance in the latter two test sets was even lower when TSS distance was included as part of the JARVIS training feature set (AUC: 0.874 and 0.903, respectively; Supplementary Fig. 14).
We also tested the relative performance of the four JARVIS model variations (i.e., based on either Gradient Boosting or the deep-learning-based ones, including raw sequences or without). Deep learning-based models once again outperform the Gradient Boosting classifier. Moreover, we observe that the multi-module model, integrating both raw sequence information and structured data, outperforms the rest in two of the four validation sets (Supplementary Fig. 11). The deep learning model based solely on raw sequence information is the top performer in the other two sets and especially yielding a considerably higher performance on the GWAS validation set (AUC = 0.790). These results demonstrate the added value from leveraging raw sequence information with deep learning in such classification tasks.
Overall, despite its construction being restricted to human lineage and sequence context information, JARVIS either outperforms or performs comparably with state-of-the-art scores that integrate multiple conservation-informed metrics.
Prioritization of putative pathogenic structural variants
Structural variants are large DNA rearrangements34 that have been implicated to have a profound impact in evolution and human disease35,36,37. We sought to estimate the ability of JARVIS and gwRVIS to distinguish large structural variants based on their inferred clinical impacts. We employ for this task a rich set of structural variants (SV) called from 14,891 whole genome sequences in the gnomAD dataset35 (v2.1). Structural variants overlapping with protein-coding regions have been annotated with various functional consequences (Copy Gain, Duplication-LoF, Intronic, LoF, Partial-Duplication, Promoter, UTR, and Whole-Gene inversion) or are otherwise classified as intergenic SVs. We consider the latter case (SV in intergenic regions) as our negative set of non-dosage sensitive regions and try to classify it against all other genic structural variants, that have proven to be dosage sensitive. Structural variant lengths included in this dataset vary from two nucleotides to a few million (Supplementary Fig. 12). However, the largest proportion of variants from five out of the nine structural variant classes (intronic, UTR, intergenic, promoter, and partial duplication) have length distributions centered around 100–1000 base pairs. In order to classify pathogenic structural variants from benign ones, we represent each variant as a consensus of four summary statistics across its length. Specifically, we assign to each structural variant the average of the median, mean, first and third quartiles of the respective genome-wide score (see also Methods). Thus, each structural variant is eventually represented by a single value that captures the aggregate profile of a score’s distribution within that region. We then used a 10-fold cross-validation approach based on a Logistic Regression model for benchmarking between the top-performing scores from the previous validation tests (JARVIS, LINSIGHT, ncER, CADD, Eigen-PC) along with gwRVIS and Orion.
We observe that JARVIS achieves the highest performance in six out of eight comparisons (Fig. 6a; AUC = 0.684–0.844), outperforming all other scores that follow with lower AUC ranges (Orion: 0.591–0.755; LINGISHT AUC: 0.605–0.747; ncER: 0.448–0.709, and gwRVIS: 0.542–0.667, ordered by the highest AUC in each range). In all these cases, JARVIS was significantly better than all other (DeLong test p < 0.022, Supplementary Fig. 21 and Supplementary Tables 11–16), except for LINSIGHT in the inversion-span Structural Variant test set, with a non-significantly better performance (AUC = 0.751 vs. 0.747, from JARVIS and LINSIGHT, respectively; DeLong test p < 0.087). As for the UTR-related Structural Variant class, JARVIS significantly outperforms ncER, gwRVIS, and Eigen-PC (AUC = 0.707; DeLong test p < 6.12 × 10−7) but is also significantly outperformed by LINSIGHT, Orion, and CADD (AUC = 0.748, 0.747, 0.722, respectively; DeLong test p < 0.023; Supplementary Figs. 13a, 21 and Supplementary Tables 17 and 18). Finally, performance in Intronic regions is low across all scores, with Orion leading with an AUC of 0.587 (DeLong test p = 9.86 × 10−29) and gwRVIS and JARVIS closely following with AUC scores of 0.564 and 0.562, respectively (Supplementary Figs. 13a and 21).
a ROC curves from classification of nongenic structural variants (intergenic) against different sets of putative pathogenic genic SVs, using a tenfold cross-validation approach with a logistic regression model on five scores: JARVIS, gwRVIS, LINSIGHT, ncER, and Orion. One-sided DeLong’s tests have been performed to assess the statistical significance of the differences in predictive power between JARVIS and the rest of genome-wide scores. b gwRVIS distribution across the UTR regions for the subset of called structural variants vs. the entire genomic class (median gwRVIS: −0.020 vs. −0.51, two-sided Mann–Whitney U p < 1 × 10−308). Tolerance increases toward greater gwRVIS values. c JARVIS distribution across the UTR regions for the subset of called structural variants versus the entire genomic class (mean JARVIS: 0.147 vs. 0.326, two-sided Mann–Whitney U p < 1 × 10−308). Pathogenicity likelihood increases toward greater JARVIS values.
Furthermore, we wanted to examine how structural variants called within a particular genomic class rank in terms of their intolerance to variation (as captured by gwRVIS) and pathogenicity likelihood (as captured by JARVIS) compared to the rest of the genomic class. We expect that large genomic alterations observed among the general population should not have a detrimental effect on fitness and thus should map into more tolerant regions of the human genome. We focus on these types of comparisons on SVs called in UTRs and intronic regions, where a disease-relevant effect can be traced back to a specific protein-coding gene. We observe in UTRs that the called SVs have a significantly more tolerant profile (higher values) compared to the entire distribution (Fig. 6b; median gwRVIS: −0.020 vs. −0.51, Mann–Whitney U p < 1 × 10−308) and the same pattern, albeit at a lower degree, emerges for intronic regions (Supplementary Fig. 13b; median gwRVIS: 0.051 vs. −0.039, Mann–Whitney U p < 1 × 10−308), in accordance with our original hypothesis. Similarly, the pathogenicity likelihood of SVs in UTRs and intronic regions is lower compared to the rest of each distribution (Fig. 6b and Supplementary Fig. 13b) with respective mean JARVIS scores of 0.147 vs. 0.326 for UTRs and 0.119 vs. 0.142 for the Intronic regions (Mann–Whitney U p < 1 × 10−308 in both). This result is again in accordance with the expectation of the lower pathogenic effect of large structural variations found in the general population.
Comments
Something to say?
Log in or Sign up for free