The UMD PREDICTOR Pro system
Single Nucleotide Substitutions (SNP) represent the majority of human genetics variations with about 80,000 variants per human exome and more than 3,000,000 variations per human genome. SNPs also account for most human disease-causing mutations with approximately 56% of missense and nonsense mutations. If classifying nonsense mutations as disease-causing is trivial, the classification of missense or synonymous mutations is challenging.
The UMD-Predictor system is an innovative bioinformatics solution to predict the pathogenicity of any SNP from any human transcript. It relies on an original combinatorial approach that consistently outperformed other predictors as illustrated in the figure adapted from Salgado et al. 2016.
UMD Predictor Pro takes the unique combinatorial algorithm of UMD Predictor and adapts it to the GRCh38 reference genome (Ensembl 108 annotations). The addition of new parameters, such as those derived from DOLPHIN, enrich this algorithm and make it more efficient for all mutations including those localized in protein domains.
As illustrated in the following figures, UMD Predictor Pro outperforms all competing products for all missense mutations contained in ClinVar, and even more for those contained in protein domains that correspond to 40% of the proteins and where most disease-causing mutations are localized.
Even if some competitors have been trained directly on ClinVar, the untrained combinatorial algorithm of UMD Predictor Pro outperforms them both in terms of accuracy and in terms of Matthews correlation coefficient.
The UMD Predictor Pro algorithm combines the following features:
Blosum62 conservation matrix (global conservation)
Yu’s biochemical substitution matrix (based on 48 physicochemical properties of amino acids)
Protein key residues (both for structure or activity)
Predicted impact on splicing signals (splice sites and auxiliary sequences)
Variation frequency at the human population level
Conservation score in 100 species with Grantham’s substitution matrix (protein specific conservation)
For mutations localized in protein domains, the use of the DOLPHIN wt and ∆ scores
Each parameter has a different weight corresponding to its biological consequence. Thus the alteration of a wild type donor or acceptor splice site has a major weight, while the creation of a cryptic splice site is limited.
According to ACMG/AMP guidelines "most algorithms for missense variant prediction are 65-80% accurate when examining known disease variants. Most tools also tend to have a low specificity". As shown above, the UMD Predictor Pro system is the only one with an accuracy above 90% and a Matthews Correlation Coefficient (MCC) above 0.8 indicating the best sensitivity and specificity combination. These statistics bring in silico predictors to a new dimension.
The UMD Predictor Pro system is freely accessible to academics with a daily limit. It can be integrated in any bioinformatics pipeline through web services. Commercial entities can use UMD Predictor Pro thanks to an annual unlimited license.