Predict your next investment

ELECTRONICS | Electronic test, measurement & monitoring

See what CB Insights has to offer

Stage

Acquired | Acquired

About Exac

Manufacturer of precision mass flow meters for process control

Exac Headquarter Location

1370 Dell Avenue

Campbell, California, 95008,

United States

Latest Exac News

Improving the informativeness of Mendelian disease-derived pathogenicity scores for common disease

Dec 7, 2020

Abstract Despite considerable progress on pathogenicity scores prioritizing variants for Mendelian disease, little is known about the utility of these scores for common disease. Here, we assess the informativeness of Mendelian disease-derived pathogenicity scores for common disease and improve upon existing scores. We first apply stratified linkage disequilibrium (LD) score regression to evaluate published pathogenicity scores across 41 common diseases and complex traits (average N = 320K). Several of the resulting annotations are informative for common disease, even after conditioning on a broad set of functional annotations. We then improve upon published pathogenicity scores by developing AnnotBoost, a machine learning framework to impute and denoise pathogenicity scores using a broad set of functional annotations. AnnotBoost substantially increases the informativeness for common disease of both previously uninformative and previously informative pathogenicity scores, implying that Mendelian and common disease variants share similar properties. The boosted scores also produce improvements in heritability model fit and in classifying disease-associated, fine-mapped SNPs. Our boosted scores may improve fine-mapping and candidate gene discovery for common disease. Introduction Despite considerable progress on pathogenicity scores prioritizing both coding and non-coding variants for Mendelian disease 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 (reviewed in ref. 11 ), little is known about the utility of these pathogenicity scores for common disease. The shared genetic architecture between Mendelian disease and common disease has been implicated in studies reporting the impact of genes underlying monogenic forms of common diseases on the corresponding common diseases 12 , significant comorbidities among Mendelian and complex diseases 13 , and gene-level overlap between Mendelian diseases and cardiovascular diseases 14 , 15 , 16 , neurodevelopmental traits 17 , 18 , and other complex traits 19 . However, variant-level assessment of shared genetic architecture using Mendelian disease-derived pathogenicity scores has not been explored. Thus, our current understanding of the genetic relationship between Mendelian disease and common disease remains limited. Here, we assess the informativeness of Mendelian disease-derived pathogenicity scores for common disease and improve upon existing scores. We focus our attention on polygenic common and low-frequency variant architectures, which explain the bulk of common disease heritability 20 , 21 , 22 , 23 , 24 . We assess the informativeness of annotations defined by top variants from published Mendelian disease-derived pathogenicity scores by applying stratified linkage disequilibrium (LD) score regression 25 (S-LDSC) with the baseline-LD model 26 , 27 to 41 independent common diseases and complex traits (average N = 320 K). We assess informativeness conditional on the baseline-LD model, which includes a broad set of coding, conserved, regulatory, and LD-related annotations. Then, we improve upon the published pathogenicity scores by developing AnnotBoost, a gradient boosting-based machine-learning framework to impute and denoise pathogenicity scores using functional annotations from the baseline-LD model. We assess the informativeness of annotations defined by top variants from the boosted scores by applying S-LDSC to quantify conditional informativeness after considering annotations from the baseline-LD model as well as annotations derived from the corresponding published scores. We also assess the informativeness of the published and boosted pathogenicity scores in producing improvements in heritability model fit and in predicting disease-associated, fine-mapped SNPs. We find that several annotations derived from published pathogenicity scores are informative for common disease, even after conditioning on annotations from the baseline-LD model. Furthermore, AnnotBoost substantially increases the informativeness for a common disease of both previously uninformative and previously informative pathogenicity scores, implying that Mendelian and common disease variants share similar properties. We conclude that our boosted scores have high potential to improve fine-mapping and candidate gene discovery for common disease. Results Overview of methods We define a binary annotation as an assignment of a binary value to each of low-frequency (0.5% ≤ MAF < 5%) and common (MAF ≥ 5%) SNP in a 1000 Genomes Project European reference panel 28 , as in our previous work 25 , 27 . We define a pathogenicity score as an assignment of a numeric value quantifying predicted pathogenicity, deleteriousness, and/or protein function to some or all of these SNPs; we refer to theses score as Mendelian disease-derived pathogenicity scores, as these scores have predominantly been developed and assessed in the context of Mendelian disease (e.g., using pathogenic variants from ClinVar 29 and HGMD 30 ). We analyze 11 Mendelian disease-derived missense scores, six genome-wide Mendelian disease-derived scores, and 18 additional scores. Our primary focus is on binary annotations defined either using top variants from published (missense or genome-wide) Mendelian disease-derived pathogenicity scores, or using top variants from boosted scores that we constructed from those pathogenicity scores using AnnotBoost, a gradient boosting-based framework that we developed to impute and denoise pathogenicity scores using 75 codings, conserved, regulatory and LD-related annotations from the baseline-LD model 26 , 27 (Supplementary Fig. 1 ; see “Methods”). AnnotBoost uses decision trees to distinguish pathogenic variants (defined using the input pathogenicity score) from benign variants; the AnnotBoost model is trained using the XGBoost gradient boosting software 31 . AnnotBoost uses odd (respectively even) chromosomes as training data to make predictions for even (respectively odd) chromosomes; the output of AnnotBoost is the predicted probability of being pathogenic. We note that Mendelian disease-derived pathogenicity scores may score a subset of SNPs, but every baseline-LD model annotation scores all SNPs. Further details are provided in the Methods section; we have publicly released open-source software implementing AnnotBoost (see “Code availability”), as well as all pathogenicity scores and binary annotations analyzed in this work (see “Data availability”). We assessed the informativeness of the resulting binary annotations for common disease heritability by applying S-LDSC 25 to 41 independent common diseases and complex traits 32 (average N = 320 K; Supplementary Table  1 ; see “Data availability”), conditioned on coding, conserved, regulatory and LD-related annotations from the baseline-LD model 26 , 27 and meta-analyzing results across traits. We assessed informativeness for common disease using standardized effect size (τ*), defined as the proportionate change in per-SNP heritability associated to a one standard deviation increase in the value of the annotation, conditional on other annotations 26 (see “Methods”). We also computed the heritability enrichment, defined as the proportion of heritability divided by the proportion of SNPs. Unlike enrichment, τ* quantifies effects that are unique to the focal annotation; annotations with significantly positive or negative τ* are informative after considering all other annotations in the model, whereas annotations with τ* = 0 contain no unique information, even if they are enriched for heritability (see “Methods”). While S-LDSC models linear combinations of functional annotations, AnnotBoost constructs (linear and) non-linear combinations of baseline-LD model annotations to provide unique information. Informativeness of Mendelian disease-derived missense scores for common disease We assessed the informativeness for a common disease of binary annotations derived from 11 Mendelian disease-derived pathogenicity scores for missense variants 1 , 5 , 6 , 7 , 8 , 33 , 34 , 35 , 36 , 37 (see Table  1 ). These scores reflect the predicted impact of missense mutations on Mendelian disease; we note that our analyses of the common disease are focused on common and low-frequency variants, but these scores were primarily trained using very rare variants from ClinVar 29 and Human Gene Mutation Database (HGMD) 30 . For each of the 11 missense scores, we constructed binary annotations based on top missense variants using five different thresholds (from top 50% to top 10% of missense variants) and applied S-LDSC 25 , 26 to 41 independent common diseases and complex traits (Supplementary Table  1 ), conditioning on coding, conserved, regulatory and LD-related annotations from the baseline-LD model 26 , 27 and meta-analyzing results across traits; proportions of top SNPs were optimized to maximize informativeness (see “Methods”). We incorporated the 5 different thresholds into the number of hypotheses tested when assessing statistical significance (Bonferroni P < 0.05/500 = 0.0001, based on a total of ≈ 500 hypotheses tested in this study; see “Methods”). We identified (Bonferroni-significant) conditionally informative binary annotations derived from two published missense scores: the top 30% of SNPs from MPC 36 (enrichment = 27x (s.e. 2.5), τ* = 0.60 (s.e. 0.07)) and the top 50% of SNPs from PrimateAI 8 (enrichment = 17x (s.e. 2.0), τ* = 0.42 (s.e. 0.09) (Fig. 1 , Table  2 and Supplementary Data  1 ). The MPC (Missense badness, PolyPhen-2, and Constraint) score 36 is computed by identifying regions within genes that are depleted for missense variants in ExAC data 38 and incorporating variant-level metrics to predict the impact of missense variants; the PrimateAI score 8 is computed by eliminating common missense variants identified in other primate species (which are presumed to be benign in humans), incorporating a deep-learning model trained on the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species. The remaining published Mendelian disease-derived missense scores all had derived binary annotations that were significantly enriched for disease heritability (after Bonferroni correction) but not conditionally informative (except for the published M-CAP 7 score, which spanned too few SNPs to be included in the S-LDSC analysis). Table 1 11 Mendelian disease-derived missense and six genome-wide Mendelian disease-derived pathogenicity scores. Second, we assessed each model’s accuracy of classifying three different sets of fine-mapped SNPs (from 10 LD-, MAF-, and genomic-element-matched control SNPs in the reference panel 28 ): 7,333 fine-mapped for 21 autoimmune diseases from Farh et al. 59 , 3768 fine-mapped SNPs for inflammatory bowel disease from Huang et al. 60 , and 1851 fine-mapped SNPs for 49 UK Biobank traits from Weissbrod et al. 61 . We note that with the exception of Weissbrod et al. fine-mapped SNPs (stringently defined by causal posterior probability ≥ 0.95; FDR < 0.05), 95% credible fine-mapped SNPs likely include a large fraction of non-causal variants. We computed the AUPRC attained by the combined joint model and the combined marginal model, relative to a model with no functional annotations (ΔAUPRC), aggregated by training a gradient boosting model (multi-score analysis); we used odd (respectively even) chromosomes as training data to make predictions for even (respectively odd) chromosomes (see “Methods”). We note that this gradient boosting model uses disease data (fine-mapped SNPs), whereas AnnotBoost does not use disease data to construct boosted pathogenicity scores; specifically, our boosted scores do not use fine-mapped SNPs. The combined joint model attained a +2.5% to 6.9% larger ΔAUPRC than the baseline-LD model (each P < 3e–28); the combined marginal model attained a +4.9% to 21.3% larger ΔAUPRC than the baseline-LD model (each P < 7e−100); we obtained similar results using AUROC (Supplementary Fig. 6 , Supplementary Data  21 ). This improvement likely comes from non-linear interactions involving the boosted annotations, published annotations, and the baseline-LD model. We performed eight secondary analyses. First, we repeated \(\,{\mathrm{log}}{l}_{SS}\) analysis on the model with 19 new annotations with conditional τ* > 0.5; we determined this model attained a +10.6% larger \(\Delta{\mathrm{log}}{l}_{SS}\) and +9.7% larger AIC than the baseline-LD model (P < 2e−50) (Supplementary Data  20 ). Second, we applied SHAP 41 to investigate which features contributed the most to classification of fine-mapped SNPs; we determined that boosted scores often drove the predictions, validating the potential utility of boosted scores in functionally informed fine-mapping (e.g., H3K9ac↑, CADD) (Supplementary Figs. 7 and 8 ). Third, we repeated the classification of fine-mapped SNPs using a single LD-, MAF-, and genomic-element-matched control variant (instead of 10 control variants) for each fine-mapped SNP, and obtained similar results (Supplementary Data  21 ). Fourth, we repeated the classification of fine-mapped disease SNPs analysis of Weissbrod et al. fine-mapped SNPs using 1379 SNPs that were fine-mapped without using functional information 61 (to ensure that results were not circular), and obtained similar results (Supplementary Fig. 6 , Supplementary Data  21 ). Fifth, we computed the AUPRCs for classifying fine-mapped SNPs individually attained by each of 82 published and 82 boosted scores (single-score analysis), comparing results for boosted scores vs. the corresponding published scores. The boosted scores significantly outperformed the corresponding published scores in each case (66/82 to 80/82 scores; Supplementary Fig. 9 , Supplementary Data  22 and 23 ). We also found that AUPRC and AUROC results for published and boosted scores were moderately correlated with S-LDSC results (up to r = 0.67) for binary annotations derived from these scores, validating the S-LDSC results (Supplementary Data  24 ). Sixth, we repeated the single-score and multi-score analysis using 14,807 NHGRI GWAS SNPs 62 , 63 ; we obtained similar results (Supplementary Figs. 6 and 9 , Supplementary Data  21 and 22 ). Seventh, we computed genome-wide correlations between all annotations analyzed including baseline-LD model annotations (Supplementary Data  18 ). Several of the jointly significant annotations were strongly correlated (up to 0.73) with conservation-related annotations from the baseline-LD model, particularly binary GERP scores, consistent with our SHAP results (Supplementary Figs. 2 , 4, and 5 ). Eighth, we compared the informativeness of the baseline-LD model and the combined joint model. We identified the addition of 11 jointly significant annotations greatly reduced the informativeness of several existing baseline-LD annotations, including conservation-related annotations (e.g., conserved primate, binary GERP scores) and other annotations (e.g., coding, CpG content; see Supplementary Fig. 10 and Supplementary Data  25 ), recapitulating the informativeness of 11 jointly significant annotations. We conclude that the combined joint model and the combined marginal model both significantly outperformed the baseline-LD model, validating the informativeness of our new annotations for common disease. The improvement was much larger for the combined marginal model; this finding was surprising, in light of our previous work advocating for conservatively restricting to jointly significant annotations when expanding heritability models 26 , 32 , 39 , 40 , 53 . However, we caution that due to the much larger number of new annotations in the combined marginal model, the combined joint model may still be preferred in some settings. Discussion We analyzed the informativeness of a broad set of Mendelian disease-derived pathogenicity scores across 41 independent common diseases and complex traits to show that several annotations derived from published Mendelian disease-derived scores were conditionally informative for the common disease after conditioning on the baseline-LD model. We further developed AnnotBoost, a gradient boosting-based machine-learning framework to impute and denoise existing pathogenicty scores. We determined that annotations derived from boosted pathogenicity scores were even more informative for common disease, resulting in 64 marginally significant annotations and 11 jointly significant annotations and implying that Mendelian disease variants and common disease variants share similar properties. These variant-level results are substantially different from previous studies of gene-level overlap between Mendelian diseases and complex traits 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 . Notably, our new annotations produced significant improvements in heritability model fit and in classifying disease-associated, fine-mapped SNPs. We also detected significant excess overlap between genes linked to our new annotations and biologically important gene sets. We note three key differences between AnnotBoost and previous approaches that utilized gradient boosting to identify pathogenic missense 7 and non-coding variants 9 , 10 . First, AnnotBoost uses a pathogenicity score as the only input and does not use disease data (e.g., ClinVar 29 or HGMD 30 ). Second, AnnotBoost produces genome-wide scores, even when some SNPs are unscored by the input pathogenicity score. Third, AnnotBoost leverages 75 diverse features from the baseline-LD model 26 , 27 , significantly more than the previous approaches 7 , 9 , 10 . Indeed, we determined that AnnotBoost produces strong signals even when conditioned on those approaches. Our findings have several ramifications for improving our understanding of the common disease. First, elucidating specific mechanistic links between Mendelian disease and common disease may yield important biological insights. Second, it is of interest to assess the informativeness for the common disease of Mendelian disease pathogenicity scores that may be developed in the future, particularly after imputing and denoising these scores using AnnotBoost; this would further elucidate the shared properties between Mendelian disease variants and common disease variants. Third, annotations derived from published and boosted Mendelian pathogenicity scores can be used to improve functionally informed fine-mapping 61 , 64 , 65 , 66 , 67 , motivating their inclusion in future large-scale fine-mapping studies. (On the other hand, we anticipate that our new annotations will be less useful for improving functionally informed polygenic risk prediction 68 , 69 and association mapping 70 , because there is pervasive LD between SNPs in an annotation and SNPs outside of an annotation, such that these annotations do not distinguish which LD blocks contain causal signal.) Fourth, the larger improvement for our combined marginal model versus our combined joint model (Fig. 5 a) advocates for a more inclusive approach to expanding heritability models, as compared to our previous work advocating for conservatively restricting to jointly significant annotations 26 , 32 , 39 , 40 , 53 . However, the combined marginal model suffers a cost of reduced interpretability (it contains a much larger number of new annotations, and it is unclear which of these annotations are providing the improvement), thus the combined joint model may still be preferred in some settings. Fifth, gene scores derived from published and boosted Mendelian pathogenicity scores can be used to help identify biologically important genes; we constructed gene scores by linking SNPs to their nearest gene, but better strategies for linking regulatory variants to genes 71 , 72 , 73 could potentially improve upon our results. We note several limitations of our work. First, we focused our analyses on the common disease (which are driven by common and low-frequency variants) and did not analyze Mendelian diseases (which are driven by very rare variants); the application of AnnotBoost to impute and denoise very rare pathogenic variants for Mendelian disease is a direction for future work. Second, we primarily report results that are meta-analyzed across 41 traits (analogous to the previous studies 25 , 26 , 32 , 39 , 40 , 53 ), but results and their interpretation may vary substantially across traits. Nonetheless, our combined marginal model produced a significant improvement in heritability model fit for 30/30 UK Biobank traits analyzed (Fig. 5 b). Third, S-LDSC is not well-suited to the analysis of annotations spanning a very small proportion of the genome, preventing the analysis of a subset of published pathogenicity scores; nonetheless, our main results attained high statistical significance. Fourth, we restricted all of our analyses to European populations, which have the largest available GWAS sample size. However, we expect our results to be generalizable to other populations, as functional enrichments have been shown to be highly consistent across ancestries 65 , 74 , 75 ; we note that assessing functional enrichments in admixed populations 76 would require the application of an unpublished extension of S-LDSC 77 . Fifth, the gene-based SNP scores that we analyzed did not perform well, perhaps because they were defined using 100kb windows, a crude strategy employed in the previous work 32 , 53 , 78 ; better strategies for linking regulatory variants to genes 71 , 72 , 73 (as shown in above gene scores) could potentially improve upon those results. Despite these limitations, the imputed and denoised pathogenicity scores produced by our AnnotBoost framework have high potential to improve gene discovery and fine-mapping for common disease. Methods Genomic annotations and the baseline-LD model We define a genomic annotation as an assignment of a numeric value to each SNP above a specified minor allele frequency (e.g., MAF ≥ 0.5%) in a predefined reference panel (e.g., 1000 Genomes 28 ). Continuous-valued annotations can have any real value. Probabilistic annotations can have any real value between 0 and 1. A binary annotation can be viewed as a subset of SNPs (the set of SNPs with annotation value 1); we note all annotations analyzed in this work are binary annotations. Annotations that correspond to known or predicted functions are referred to as functional annotations. The baseline-LD model 26 (v2.1) contains 86 functional annotations (see “Data availability”). We use these annotations as features of AnnotBoost (see below). These annotations include genomic elements (e.g., coding, enhancer, promoter), conservation (e.g., GERP, PhastCon), regulatory elements (e.g., histone marks, DNaseI-hypersensitive sites (DHS), transcription factor (TF) binding sites), and LD-related annotations (e.g., predicted allele age, recombination rate, SNPs with low levels of LD). Enrichment and τ* metrics We used stratified LD score regression (S-LDSC 25 , 26 ) to assess the contribution of an annotation to disease heritability by estimating the enrichment and the standardized effect size (τ*) of an annotation. Let acj represent the (binary or probabilistic) annotation value of the SNP j for the annotation c. S-LDSC assumes the variance of per normalized genotype effect sizes is a linear additive contribution to the annotation c: $$\,{\text{Var}}\,({\beta }_{j})={\sum }_{c}{a}_{cj}{\tau }_{c}$$ (1) where τc is the per-SNP contribution of the annotation c. S-LDSC estimates τc using the following equation: $$\,{\text{E}}\,[{\chi }_{j}^{2}]=N{\sum }_{c}\ell (j,c){\tau }_{c}+1$$ (2) where N is the sample size of the GWAS and ℓ(j, c) is the LD score of the SNP j to the annotation c. The LD score is computed as follow \(\ell (j,c)={\sum }_{k}{a}_{ck}{r}_{jk}^{2}\) where rjk is the correlation between the SNPs j and k. We used two metrics to assess the informativeness of an annotation. First, the standardized effect size (τ*), the proportionate change in per-SNP heritability associated with a one standard deviation increase in the value of the annotation (conditional on all the other annotations in the model), is defined as follows: $${\tau }_{c}^* =\frac{{\tau }_{c}sd(C)}{{h}_{g}^{2}/M}$$ (3) where sd(C) is the standard deviation of the annotation c, \({h}_{g}^{2}\) is the estimated SNP-heritability, and M is the number of variants used to compute \({h}_{g}^{2}\) (in our experiment, M is equal to 5,961,159, the number of common SNPs in the reference panel). The significance for the effect size for each annotation, as mentioned in the previous studies 26 , 32 , 53 , is computed as (\(\frac{{\tau }^{* }}{\,\text{se}\,({\tau }^{* })} \sim N(0,1)\)), assuming that \(\frac{{\tau }^{* }}{\,\text{se}\,({\tau }^{* })}\) follows a normal distribution with zero mean and unit variance. Second, enrichment of the binary and probabilistic annotation is the fraction of heritability explained by SNPs in the annotation divided by the proportion of SNPs in the annotation, as shown below: $$\,{\text{Enrichment}}\,=\frac{ \% {h}_{g}^{2}(C)}{ \% \,{\text{SNP}}\,(C)}=\frac{ \frac{{h}_{g}^{2}(C)}{{h}_{g}^{2}}}{\frac{{\sum }_{j}{a}_{jc}}{M}}$$ (4) where \({h}_{g}^{2}(C)\) is the heritability captured by the cth annotation. When the annotation is enriched for trait heritability, the enrichment is  >1; the overlap is greater than one would expect given the trait heritablity and the size of the annotation. The significance for enrichment is computed using the block jackknife as mentioned in the previous studies 25 , 32 , 53 , 78 .). The key difference between enrichment and τ* is that τ* quantifies effects that are unique to the focal annotation after conditioning on all the other annotations in the model, while enrichment quantifies effects that are unique and/or non-unique to the focal annotation. In all our analyses, we used the European samples in 1000 G 28 (see “Data availability”) as reference SNPs. Regression SNPs were obtained from HapMap 3 79 (see “Data availability”). SNPs with marginal association statistics  >80 and SNPs in the major histocompatibility complex (MHC) region were excluded. Unless stated otherwise, we included the baseline-LD model 26 in all primary analyses using S-LDSC, both to minimize the risk of bias in enrichment estimates due to model mis-specification 25 , 26 and to estimate effect sizes (τ*) conditional on known functional annotations. Published Mendelian disease-derived pathogenicity scores We considered a total 35 published scores: 11 Mendelian disease-derived missense pathogenicity scores, 6 genome-wide Mendelian disease-derived pathogenicity scores, and 18 additional scores (see Table  1 and Supplementary Data  13 ). Here, we provide a short description for Mendelian missense and genome-wide Mendelian disease-derived pathogenicity scores. Details for 18 additional scores and the baseline-LD annotations are provided in Supplementary Data  13 . Our curated pathogenicity scores are available online (see “Data availability”). For all scores, we constructed annotations using GRCh37 (hg19) assembly limited to all 9,997,231 low-frequency and common SNPs (with MAF ≥ 0.5%) found in 1000 Genomes 28 European Phase 3 reference genome individuals. Mendelian missense scores were readily available from dbNSFP database 80 , 81 using a rank score (a converted score based on the rank among scored SNPs); genome-wide Mendelian disease-derived scores were individually downloaded and used with no modification to original scores (see “Data availability”). For each pathogenicity score, we constructed a binary annotation based on an optimized threshold (See below). Short descriptions for each pathogenicity score (excluding 18 additional scores and the baseline-LD annotations; provided in Supplementary Data  13 ) are provided below: Mendelian disease-derived missense pathogenicity scores: PolyPhen-2 1 , 33 (HDIV and HVAR): Higher scores indicate a higher probability of the missense mutation being damaging on the protein function and structure. The default predictor is based on a naive Bayes classifier using HumDiv (HDIV), and the other is trained using HumVar (HVAR), using eight sequence-based and three structure-based features. MetaLR/MetaSVM 34 : An ensemble prediction score based on logistic regression (LR) or support vector machine (SVM) to classify pathogenic mutations from background SNPs in whole-exome sequencing, combining nine prediction scores and one additional feature (maximum minor allele frequency). PROVEAN 35 , 82 : An alignment-based score to predict the damaging single amino acid substitutions. SIFT 4G 5 : Predicted deleterious effects of an amino acid substitution to protein function based on sequence homology and physical properties of amino acids. REVEL 6 : An ensemble prediction score based on a random forest classifier trained on 6182 missense disease mutations from HGMD 30 , using 18 pathogenicity scores as features. M-CAP 7 : An ensemble prediction score based on a gradient boosting classifier trained on pathogneic variants from HGMD 30 and benign variants from ExAC data set 38 , using nine existing pathogenicity scores, seven base-pair, amino acid, genomic region, and gene-based features, and four features from multiple sequence alignments across 99 species. PrimateAI 8 : A deep-learning-based score trained on the amino acid sequence flanking the variant of interest and the orthologous sequence alignments in other species and eliminating common missense variants identified in six non-human primate species. MPC 36 (missense badness, PolyPhen-2, and constraint): Logistic regression-based score to identify regions within genes that are depleted for missense variants in ExAC data 38 and incorporating variant-level metrics to predict the impact of missense variants. Higher MPC score indicates increased deleteriousness of amino acid substitutions once occurred in missense-constrained regions. MVP 37 : A deep-learning-based score trained on 32,074 pathogenic variants from ClinVar 29 , HGMD 30 , and UniProt 83 , using 38 local contexts, constraint, conservation, protein structure, gene-based, and existing pathogenicity scores as features. Genome-wide Mendelian disease-derived pathogenicity scores: CADD 2 , 46 : An ensemble prediction score based on a support vector machine classifier trained to differentiate 14.7 million high-frequency human-derived alleles from 14.7 million simulated variants, using 63 conservation, regulatory, protein-level, and existing pathogenicity scores as features. We used PHRED-scaled CADD score for all possible SNVs of GRCh37. Eigen/Eigen-PC 3 : Unsupervised machine-learning score based on 29 functional annotations and leveraging blockwise conditional independence between annotations to differentiate functional vs. non-functional variants. Eigen-PC uses the lead eigenvector of the annotation covariance matrix to weight the annotations. For both Eigen and Eigen-PC, we used PHRED-scaled scores and combined coding and non-coding regions to make it as a single genome-wide score. Higher score indicates more important (predicted) functional roles. ReMM 4 (regulatory Mendelian mutation): An ensemble prediction score based on a random forest classifier to distinguish 406 hand-curated Mendelian mutations from neutral variants using conservation scores and functional annotations. Higher ReMM score indicates the greater potential to cause a Mendelian disease if mutated. NCBoost 10 : An ensemble prediction score based on a gradient boosting classifier trained on 283 pathogenic non-coding SNPs associated with Mendelian disease genes and 2830 common SNPs, using 53 conservation, natural selection, gene-based, sequence context, and epigenetic features. ncER 9 (non-coding essential regulation): An ensemble prediction score based on a gradient boosting classifier trained on 782 non-coding pathogenic variants from ClinVar 29 and HGMD 30 , using 38 gene essentiality, 3D chromatin structure, regulatory, and existing pathogenicity scores as features. AnnotBoost framework AnnotBoost is based on gradient boosting, a machine-learning method for classification; the AnnotBoost model is trained using the XGBoost gradient boosting software 31 (see “Code availability”). AnnotBoost requires only one input, a pathogenicity score to boost, and generates a genome-wide (probabilistic) pathogenicity score (as described in Supplementary Fig. 1 ). During the training, AnnotBoost uses decision trees, where each node in a tree splits SNPs into two classes (pathogenic and benign) using 75 codings, conserved, regulatory, and LD-related features from the baseline-LD model 26 (excluding 10 MAF bins features; we obtained similar results with or without MAF bins features; see Supplementary Fig. 11 ). We note that the baseline-LD annotations considered all low-frequency and common SNPs thus do not have unscored SNPs. The method generates training data from the input pathogenicity scores without using external variant data; top 10% SNPs from the input pathogenicity score are labeled as a positive training set, and the bottom 40% SNPs are labeled as a control training set; we obtained similar results with other training data ratios (see Supplementary Fig. 12 ). The prediction is based on T additive estimators (we use T = 200–300; see below), minimizing the following loss objective function Lt at the t-th iteration: $${L}^{t}=\mathop{\sum }\nolimits_{i = 1}^{n}l({y}_{i},{\hat{{y}_{i}}}^{t-1}+{f}_{t}({x}_{i}))+\gamma ({f}_{t})$$ (5) where l is a differentiable convex loss function (which measures the difference between the prediction (\(\hat{{y}_{i}}\)) and the target yi at the i-th instance), ft is an independent tree structure, and last term γ(ft) penalizes the complexity of the model, helping to avoid over-fitting. The prediction (\(\hat{{y}_{i}}\)) is made by \(\mathop{\sum }\nolimits_{t = 1}^{T}{f}_{t}({x}_{i})\) by ensembling outputs of multiple weak-learner trees. Odd (respectively even) chromosome SNPs are used for training to score even (respectively odd) chromosome SNPs. The output of the classifier is the probability of being similar to the positive training SNPs and dissimilar to the control training SNPs. We used the following model parameters: the number of estimators (200, 250, 300), depth of the tree (25, 30, 35), learning rate (0.05), gamma (minimum loss reduction required before additional partitioning on a leaf node; 10), minimum child weight (6, 8, 10), and subsample (0.6, 0.8, 1); we optimized parameters with hyperparamters tuning (a randomized search) with fivefold cross-validation. Two important parameters to avoid over-fitting are gamma and learning rate; we chose these values consistent with the previous studies 9 , 10 . The model with the highest AUROCs on the held-out data was selected and used to make a prediction. To identify which feature(s) drives the prediction output with less bias, AnnotBoost uses Shapley Addictive Explanation (SHAP 41 ), a widely used tool to interpret complex non-linear models, instead of built-in feature importance tool because of SHAP’s property of satisfying symmetry, dummy player, and additivity axioms. SHAP uses the training matrix (features × SNP labels) and the trained model to generate a signed impact of each baseline-LD features on the AnnotBoost prediction. To evaluate the performance of classifiers, we plotted receiver operating characteristic (ROC) and precision-recall (PR) curves. As we train AnnotBoost by splitting SNPs into odd and even chromosomes, we report the average out-of-sample area under the curve (AUC) of the odd and even chromosomes classifier. We used the threshold of 0.5 to define a class; that is, class 1 includes SNPs with the output probability > 0.5. We caution that high classification accuracy does not necessarily translate into conditional informativeness for common disease 39 . Constructing binary annotations using top variants from published and boosted scores For published Mendelian disease-derived missense pathogenicity scores, we considered five different thresholds to construct binary annotations: top 50, 40, 30, 20 or 10% of scored variants. For published scores that produce Bonferroni-significant binary annotations, we report results for the binary annotation with the largest ∣τ*∣ among those that are Bonferroni-significant. For published scores that do not produce Bonferroni-significant binary annotations, we report results for the threshold with the most significant τ* (even though not Bonferroni-significant). For all other published pathogenicity scores, we considered the top 10, 5, 1, 0.5 or 0.1% of scored variants to construct binary annotations; we used more inclusive thresholds for published Mendelian disease-derived missense pathogenicity scores due to the small proportion of variants scored (~0.3%; see Table  1 ). For published scores that produce Bonferroni-significant binary annotations, we report results for the binary annotation with the largest ∣τ*∣ among those that are Bonferroni-significant. For published scores that do not produce Bonferroni-significant binary annotations, we report results for the top 5% of variants (the average optimized proportion among Bonferroni-significant binary annotations); we made this choice because (in contrast to published Mendelian missense scores) for many other published scores the most significant τ* was not even weakly significant. For boosted pathogenicity scores, we considered the top 10, 5, 1, 0.5 or 0.1% of scored variants, as well as variants with boosted scores ≥0.5; we note that top 10% of SNPs does not necessarily translate to 10% of SNPs, as some SNPs share the same score, and some genomic regions (e.g., MHC) are excluded when running S-LDSC (see below). For boosted scores that produce Bonferroni-significant binary annotations, we report results for the binary annotation with largest ∣τ*∣ among those that are Bonferroni-significant. For boosted scores that do not produce Bonferroni-significant binary annotations, we report results for the top 5% of variants. In all analyses, we excluded binary annotations with a proportion of SNPs  <0.02% (the same threshold used in ref. 53 ), because S-LDSC does not perform well for small annotations 25 . We analyzed 155 annotations derived from published scores (31 published scores (Table  2 ), 5 thresholds for top x% of variants, 31 × 5 = 155), such that 500 hypotheses is a conservative correction in the analysis of published scores. We also analyzed 492 annotations derived from boosted scores (82 underlying published scores including 47 baseline-LD model annotations (Table  2 ), 6 thresholds for top x% of variants, 82 × 6 = 492), such that 500 hypotheses is a roughly appropriate correction in the analysis of boosted scores. For simplicity, we corrected for max(155,492) ≈ 500 hypotheses throughout. We note that, in the meta-analysis τ* p-values, a global FDR < 5% corresponds to P < 0.0305; thus, our choice of P < 0.05/500 = 0.0001 is conservative. In all primary analyses, we analyzed only binary annotations. However, we verified in a secondary analysis of the CDTS score 47 that probabilistic annotations produced results similar to binary annotations (see Supplementary Fig. 13 ). Heterogeneity of enrichment and τ* across traits For a given annotation, we assessed the heterogeneity of enrichment and τ* (across 41 independent traits) by estimating the standard deviation of the true parameter value across traits, as analogous to ref. 23 . We calculated the cross-trait τ* as the inverse variance weighted mean across the traits. Then, we compared \(\mathop{\sum }\nolimits_{i = 1}^{n}{({\hat{\tau }}_{i}-{\hat{\tau }}_{\mathrm{across}-\mathrm{trait}})}^{2}/(\mathrm{std.erro{r}}_{i}^{2})\) to a \({\chi }_{n}^{2}\) null statistic, where n = 47 (41 independent traits; 47 summary statistics; see Supplementary Table  1 ). We repeated the analysis for heritability enrichment by using enrichments and standard errors of enrichment estimates from S-LDSC. Overlap between gene score quintiles informed by input pathogenicity scores and 165 reference gene sets For a given pathogenicity score, we scored genes based on the maximum pathogenicity score of linked SNPs, where SNPs were linked to a unique nearest gene using ANNOVAR 84 : 9,997,231 SNP-gene links, decreasing to 5,059,740 S2G links after restricting to 18,117 genes with a protein product (according to HGNC 85 ) that have an Ensembl gene identifier (ENSG ID). Gene scores are reported in Supplementary Data  7 . We constructed quintiles of gene scores and assessed gene-level excess fold overlap with 165 reference gene sets of biological importance (see below; summarized in Supplementary Data  8 ). We note that this analysis used continuous-valued pathogenicity scores, instead of binary annotations. The 165 reference gene sets (Supplementary Data  8 ) reflected a broad range of gene essentiality 86 metrics, as outlined in ref. 53 . They included known phenotype-specific Mendelian disease genes 19 , constrained genes 38 , 87 , 88 , 89 , essential genes 43 , 44 , 90 , specifically expressed genes across GTEx tissues 78 , dosage outlier genes across GTEx tissues 91 , genes with a ClinVar pathogenic or likely pathogenic variants 29 , genes in the Online Mendelian Inheritance in Man (OMIM 92 ), high network connectivity genes in different gene networks 53 , 93 , genes with more independent SNPs 53 , known drug targets 94 , human targets of FDA-approved drugs 95 , eQTL-deficient genes 96 , 97 , and housekeeping genes 98 ; a subset of these gene sets were previously analyzed in ref. 53 . As defined in our previous study 53 , the excess fold overlap of gene set 1 and gene set 2 is defined as follows: $$\,{\text{excess overlap(gene set 1, gene set2)}}\,={P}_{d}/{P}_{{tot}}$$ (6) where Pd = \(\frac{| \,\text{gene set 1}\cap \text{gene set 2}| }{| \text{gene set 2}\,| }\) and Ptot = \(\frac{| \,\text{gene set 1}\cap \text{all protein-coding genes}| }{| \text{all protein-coding genes}\,| }.\) The standard error for the excess overlap is similarly scaled: $${\mathrm{SE}}=\sqrt{\frac{{P}_{d}(1-{P}_{d})}{| \,\text{gene set 2}\,| }}/{P}_{{tot}}$$ (7) When there is excess overlap, the excess fold overlap is  >1; when there is depletion, the excess fold overlap is  <1. We assessed the odds ratio and significance in the difference between the excess overlap between the boosted gene quintile and the published gene quintile by the Fisher’s exact test (two-sided). Evaluating heritability model fit using \({\mathrm{log}}{l}_{SS}\) Given a heritability model (e.g., the baseline-LD model, combined joint model, or combined marginal model), we define the \(\Delta{\mathrm{log}}{l}_{SS}\) of that heritability model as the \(\,{\mathrm{log}}{l}_{SS}\) of that heritability model minus the \(\,{\mathrm{log}}{l}_{SS}\) of a model with no functional annotations (baseline-LD-nofunct; 17 LD and MAF annotations from the baseline-LD model 26 ), where \(\,{\mathrm{log}}{l}_{SS}\) 57 is an approximate likelihood metric that has been shown to be consistent with the exact likelihood from restricted maximum likelihood (REML; see Code availability). We compute p-values for \(\Delta{\mathrm{log}}{l}_{SS}\) using the asymptotic distribution of the Likelihood Ratio Test (LRT) statistic: −2\({\mathrm{log}}{l}_{SS}\) follows a χ2 distribution with degrees of freedom equal to the number of annotations in the focal model, so that \(-2\Delta{\mathrm{log}}{l}_{SS}\) follows a χ2 distribution with degrees of freedom equal to the difference in number of annotations between the focal model and the baseline-LD-nofunct model. We used UK10K as the LD reference panel and analyzed 4,631,901 HRC (haplotype reference panel 99 ) well-imputed SNPs with MAF ≥0.01 and INFO ≥ 0.99 in the reference panel; We removed SNPs in the MHC region, SNPs explaining  >1% of phenotypic variance and SNPs in LD with these SNPs. We computed \(\Delta{\mathrm{log}}{l}_{SS}\) for four heritability models: baseline-LD: annotations from the baseline-LD model 25 , 26 (86 annotations) baseline-LD + joint: baseline-LD model + 11 jointly significant annotations (3 published, 8 boosted; 97 annotations) baseline-LD + marginal-stringent: baseline-LD model + 19 marginally significant annotations with conditional |τ*| > 0.5 (105 annotations) baseline-LD + marginal: baseline-LD model + 64 marginally significant annotations (11 published, 53 boosted; 150 annotations) Classification of fine-mapped disease SNPs We assessed the classification accuracy of fine-mapped disease SNPs. Here, we consider only low-frequency and common SNPs (MAF ≥ 0.5%) and report the total number of unique SNPs (regardless MAF). we assessed the accuracy of classifying five different SNP sets (summarized in Supplementary Data  22 ): (1) 7333 fine-mapped for 21 autoimmune diseases from Farh et al. 59 (of 7747 total SNPs; 95% credible sets), (2) 3768 fine-mapped SNPs for inflammatory bowel disease from Huang et al. 60 (of 4311 total SNPs; 95% credible sets), (3) 1851 SNPs (of 2225 SNPs, spanning 3025 SNP-trait pairs; stringently defined by causal posterior probability ≥ 0.95) functionally informed fine-mapped for 49 UK Biobank traits from Weissbrod et al. 61 , (4) 1379 (of 1853 total SNPs with causal posterior probability ≥ 0.95) non-functionally informed fine-mapped SNPs for 49 UK Biobank traits from Weissbrod et al. 61 , and (5) 14,807 SNPs from the NHGRI GWAS catalog 62 , 63 (2019-07-12 version; p-value < 5e−8; we note only about 5% of GWAS SNPs are expected to be causal 59 ). For each of these five SNP sets, we matched 10 control SNPs for each positive fine-mapped SNP by matching LD, MAF, and genomic element, as in the previous studies 9 , 10 , 47 ; we note that these studies emphasized the need for matching the relative genomic region distribution in performance evaluation. MAF was based on the same reference panel (European samples from 1000 Genomes Phase 3 28 ), and LD was estimated by applying S-LDSC on all SNPs annotation (‘base’). To identify the genomic element of each SNPs and the nearest gene, we annotated these five sets of SNPs using ANNOVAR 84 using the gene-based annotation. For assigning the genomic element to each SNP, we used the default ANNOVAR prioritization rule for gene-based annotations: exonic = splicing (defined by 10bp of a splicing junction) > ncRNA > UTR5 = UTR3 > intronic > upstream = downstream > intergenic. When SNP (in the intergenic or intronic region) is associated with overlapping genes, the nearest protein-coding gene (based on the distance to the TSS or TSE) is retained. To obtain 10 control SNPs for each positive fine-mapped SNP, we first searched the control SNPs within the same genomic element and the same chromosome of that positive SNP; then kept the 10 control SNPs with the most similar LD and MAF (based on the average of rank(LD difference from the positive SNP) and rank(MAF difference from the positive SNP)). In secondary analyses, we instead retained a unique most closely matched control SNP. Given positive and control SNP sets, we computed the AUPRCs (and AUROCs) by an individual score (each of 82 published and 82 boosted scores). We refer this as a single-score analysis. We used AUPRC as a primary metric, as AUPRC is more robust for imbalanced data 100 . We assessed the significance of the difference between two AUPRCs using 1000 samples bootstrapped standard errors then performed two-sample t-test; variance of AUPRCs (and AUROCs) from 1000 samples was sufficiently small. We note this single-score analysis measures an improvement between two scores, where one score is derived from the other (e.g., our boosted score from published score). Also, we performed a multi-score analysis. For each heritability model, we aggregated scores by training a gradient boosting model (features: aggregated scores, positive label: each of five sets of SNPs, control label: LD-, MAF-, and genomic-element-matched control sets of SNPs); we used odd (respectively even) chromosomes as training data to make predictions for even (respectively odd) chromosomes. We used the same training parameters as AnnotBoost (carefully selected to avoid over-fitting, consistent with the previous study 9 , 10 ) with hyperparameters tuned using a randomized search method with fivefold cross-validation. We report the average AUPRC and AUROC of odd and even chromosome classifiers. We also computed ΔAUPRC as AUPRC of a given model minus AUPRC of baseline-LD-nofunct model. We note that no disease data (five sets of SNPs used as labels) was re-used in these analyses, as AnnotBoost uses only the input pathogenicity scores to generate positive and negative sets of training data. We assessed the significance of the difference as described above. To identify which feature(s) drives the prediction output, we applied SHAP 41 to generate a signed impact of each baseline-LD, published, and boosted score features on classifying fine-mapped disease SNPs. Reporting summary Further information on research design is available in the  Nature Research Reporting Summary linked to this article. Data availability All published and boosted pathogenicity scores and binary annotations and SHAP results are available at https://alkesgroup.broadinstitute.org/LDSCORE/Kim_annotboost/ . GWAS summary statistics are available at https://alkesgroup.broadinstitute.org/LDSCORE/independent_sumstats/ . The baseline-LD annotations (v.2.1) are available at https://alkesgroup.broadinstitute.org/LDSCORE/ . The 1000 Genomes Project Phase 3 data are available at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502 . 165 reference gene sets are available at https://github.com/samskim/networkconnectivity . Code availability AnnotBoost source code is provided here: https://github.com/samskim/annotboost/ . This work primarily uses the S-LDSC software ( https://github.com/bulik/ldsc ). SumHer software for computing \(\,{\mathrm{log}}{l}_{SS}\) is available at http://dougspeed.com/sumher/ . References 1. Adzhubei, I. A. et al. A method and server for predicting damaging missense mutations. Nat. Methods 7, 248 (2010). 2. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat. Genet. 46, 310 (2014). 3. Ionita-Laza, I., McCallum, K., Xu, B. & Buxbaum, J. D. A spectral approach integrating functional genomic annotations for coding and noncoding variants. Nat. Genet. 48, 214 (2016). 4. Smedley, D. et al. A whole-genome analysis framework for effective identification of pathogenic regulatory variants in Mendelian disease. Am. J. Hum. Genet. 99, 595–606 (2016). 5. Vaser, R., Adusumalli, S., Leng, S. N., Sikic, M. & Ng, P. C. Sift missense predictions for genomes. Nat. Protoc. 11, 1 (2016). 6. Ioannidis, N. M. et al. Revel: an ensemble method for predicting the pathogenicity of rare missense variants. Am. J. Hum. Genet. 99, 877–885 (2016). 7. Jagadeesh, K. A. et al. M-cap eliminates a majority of variants of uncertain significance in clinical exomes at high sensitivity. Nat. Genet. 48, 1581 (2016). 8. Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat. Genet. 50, 1161 (2018). 9. Wells, A. et al. Ranking of non-coding pathogenic variants and putative essential regions of the human genome. Nat. Commun. 10, 1–9 (2019). 10. Caron, B., Luo, Y. & Rausell, A. Ncboost classifies pathogenic non-coding variants in mendelian diseases through supervised learning on purifying selection signals in humans. Genome Biol. 20, 32 (2019). 11. Eilbeck, K., Quinlan, A. & Yandell, M. Settling the score: variant prioritization and mendelian disease. Nat. Rev. Genet. 18, 599 (2017). 12. Peltonen, L., Perola, M., Naukkarinen, J. & Palotie, A. Lessons from studying monogenic disease for common disease. Hum. Mol. Genet. 15, R67–R74 (2006). 13. Blair, D. R. et al. A nondegenerate code of deleterious variants in mendelian loci contributes to complex disease risk. Cell 155, 70–80 (2013). 14. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707 (2010). 16. Chong, J. X. et al. The genetic basis of mendelian phenotypes: discoveries, challenges, and opportunities. Am. J. Hum. Genet. 97, 199–215 (2015). 17. Zhu, X., Need, A. C., Petrovski, S. & Goldstein, D. B. One gene, many neuropsychiatric disorders: lessons from mendelian diseases. Nat. Neurosci. 17, 773 (2014). 19. Freund, M. K. et al. Phenotype-specific enrichment of mendelian disorder genes near gwas regions across 62 complex traits. Am. J. Hum. Genet. 103, 535–552 (2018). 20. Zeng, J. et al. Signatures of negative selection in the genetic architecture of human complex traits. Nat. Genet. 50, 746 (2018). 21. Zhang, Y., Qi, G., Park, J.-H. & Chatterjee, N. Estimation of complex effect-size distributions using summary-level statistics from genome-wide association studies across 32 complex traits. Nat. Genet. 50, 1318 (2018). 22. Zhu, X. & Stephens, M. Large-scale genome-wide enrichment analyses identify new trait-associated genes and pathways across 31 human phenotypes. Nat. Commun. 9, 4361 (2018). 23. Schoech, A. P. et al. Quantification of frequency-dependent genetic architectures in 25 uk biobank traits reveals action of negative selection. Nat. Commun. 10, 790 (2019). 24. O’Connor, L. J. et al. Extreme polygenicity of complex traits is explained by negative selection. Am. J. Hum. Genet. 105, 456–476 (2019). 25. Finucane, H. K. et al. Partitioning heritability by functional annotation using genome-wide association summary statistics. Nat. Genet. 47, 1228 (2015). 26. Gazal, S. et al. Linkage disequilibrium-dependent architecture of human complex traits shows action of negative selection. Nat. Genet. 49, 1421 (2017). 27. Gazal, S., Marquez-Luna, C., Finucane, H. K. & Price, A. L. Reconciling s-ldsc and ldak functional enrichment estimates. Nat. Genet. 51, 1202–1204 (2019). 1000 Genomes Project Consortium. et al. A global reference for human genetic variation. Nature 526, 68 (2015). 29. Landrum, M. J. et al. Clinvar: public archive of interpretations of clinically relevant variants. Nucleic Acids Res. 44, D862–D868 (2015). 30. Stenson, P. D. et al. The human gene mutation database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies. Hum. Genet. 136, 665–677 (2017). 31. Chen, T. &  Guestrin, C. Xgboost: A scalable tree boosting system. In Proc. 22nd acm sigkdd international conference on knowledge discovery and data mining ACM. pp. 785–794 (2016). 32. Hormozdiari, F. et al. Leveraging molecular quantitative trait loci to understand the genetic architecture of diseases and complex traits. Nat. Genet. 50, 1041–1047 (2018). 33. Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using polyphen-2. Curr. Protoc. Hum. Genet. 76, 7–20 (2013). 34. Dong, C. et al. Comparison and integration of deleteriousness prediction methods for nonsynonymous snvs in whole exome sequencing studies. Hum. Mol. Genet. 24, 2125–2137 (2014). 35. Choi, Y., Sims, G. E., Murphy, S., Miller, J. R. & Chan, A. P. Predicting the functional effect of amino acid substitutions and indels. PLoS ONE 7, e46688 (2012). Rights and permissions Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Predict your next investment

The CB Insights tech market intelligence platform analyzes millions of data points on venture capital, startups, patents , partnerships and news mentions to help you see tomorrow's opportunities, today.

CB Insights uses Cookies

CBI websites generally use certain cookies to enable better interactions with our sites and services. Use of these cookies, which may be stored on your device, permits us to improve and customize your experience. You can read more about your cookie choices at our privacy policy here. By continuing to use this site you are consenting to these choices.