Improved Predictions Accuracy
Single Nucleotide Substitutions (SNP) represent the majority of human genetics variations with about 80,000 variants per human exome and more than 3,000,000 variations per human genome. SNPs also account for most human disease-causing mutations with approximately 56% of missense and nonsense mutations. If classifying nonsense mutations as disease-causing is trivial, the classification of missense or synonymous mutations is challenging.
The UMD-Predictor system is an innovative bioinformatics solution to predict the pathogenicity of any SNP from any human transcript. It relies on an original combinatorial approach that consistently outperformed other predictors as illustrated in the figure adapted from Salgado et al. 2016.
UMD Predictor Pro takes the unique combinatorial algorithm of UMD Predictor and adapts it to the GRCh38 reference genome (Ensembl 108 annotations). The addition of new parameters, such as those derived from DOLPHIN, enrich this algorithm and make it more efficient for all mutations including those localized in protein domains.
As illustrated in the following figures, UMD Predictor Pro outperforms all competing products for all missense mutations contained in ClinVar, and even more for those contained in protein domains that correspond to 40% of the proteins and where most disease-causing mutations are localized.
Even if some competitors have been trained directly on ClinVar, the untrained combinatorial algorithm of UMD Predictor Pro outperforms them both in terms of accuracy and in terms of Matthews correlation coefficient.
The UMD Predictor Pro algorithm combines the following features:
Blosum62 conservation matrix (global conservation)
Yu’s biochemical substitution matrix (based on 48 physicochemical properties of amino acids)
Protein key residues (both for structure or activity)
Predicted impact on splicing signals (splice sites and auxiliary sequences)
Variation frequency at the human population level
Conservation score in 100 species with Grantham’s substitution matrix (protein specific conservation)
For mutations localized in protein domains, the use of the DOLPHIN wt and ∆ scores
Each parameter has a different weight corresponding to its biological consequence. Thus the alteration of a wild type donor or acceptor splice site has a major weight, while the creation of a cryptic splice site is limited.
Accuracy evaluation for rare splice sites. If most introns belongs to the GT/AG U2 category, some use non-canonic donor and acceptor splice sites. Very few mutations have been described on these unusual sites but the HSF professional system now contains specific matrices for each non-canonical splice site type.
To illustrate the efficiency of this approach, we collected all mutations affecting GC donor sites. As shown below, the HSF v3.0 matrix and the MaxEntScan system were able to identify respectively only 71.4% (20/28) and 55.4% (15/28) of U2-type GC donor splice sites and to correctly predict the disruption of these splice sites for respectively 53.6% (15/28) and 39.3% (11/28) of mutations. In contrast, the new HSF professional U2-type GC donor splice site matrix was able to identify 100% of GC splice sites and to predict the disruption of these sites or the activation of a nearby cryptic site for 96.4% (27/28) of mutations.
According to ACMG/AMP guidelines "most algorithms for missense variant prediction are 65-80% accurate when examining known disease variants. Most tools also tend to have a low specificity". As shown above, the UMD Predictor Pro system is the only one with an accuracy above 90% and a Matthews Correlation Coefficient (MCC) above 0.8 indicating the best sensitivity and specificity combination. These statistics bring in silico predictors to a new dimension.
The UMD Predictor Pro system is freely accessible to academics with a daily limit. It can be integrated in any bioinformatics pipeline through web services. Commercial entities can use UMD Predictor Pro thanks to an annual unlimited license.