Improved Predictions Accuracy
The HSF professional system includes several improvements (new PWM matrices for unusual donor and acceptor splice sites; combination of HSF and MaxEnt predictions to predict the impact of mutations on splicing signals; a global analysis of auxiliary splicing motifs) that improve predictions accuracy for splice sites (and auxiliary splicing signals) as illustrated below; and its already very accurate branch point prediction algorithm.
Accuracy evaluation on GT/AG splice sites. To avoid any bias introduced by data selection, we used BRCA1 and BRCA2 mutations available in the ClinVar reference systems. We extracted all annotated mutations impacting donor (last 3 exonic bases and 6 first intronic bases) and acceptor (last 12 intronic bases and 2 first exonic bases) splice sites.
The Dataset contains 135 pathogenic mutations and 16 non-pathogenic mutations from the BRCA1 gene and 88 pathogenic mutations and 15 non-pathogenic mutations from the BRCA2 gene.
ClinVar BRCA1 & BRCA2 dataset
Matthews Correlation Coefficient [more]
Accuracy evaluation for rare splice sites. If most introns belongs to the GT/AG U2 category, some use non-canonic donor and acceptor splice sites. Very few mutations have been described on these unusual sites but the HSF professional system now contains specific matrices for each non-canonical splice site type.
To illustrate the efficiency of this approach, we collected all mutations affecting GC donor sites. As shown below, the HSF v3.0 matrix and the MaxEntScan system were able to identify respectively only 71.4% (20/28) and 55.4% (15/28) of U2-type GC donor splice sites and to correctly predict the disruption of these splice sites for respectively 53.6% (15/28) and 39.3% (11/28) of mutations. In contrast, the new HSF professional U2-type GC donor splice site matrix was able to identify 100% of GC splice sites and to predict the disruption of these sites or the activation of a nearby cryptic site for 96.4% (27/28) of mutations.
Predictions of the impact of disease-causing mutations localized in U2-type GC donor splice sites. Predictions: B = the donor splice site is identified and the mutation is predicted to disrupt it; U = the donor splice site not identified by the system; NI = the donor splice site is identified and the mutation does not disrupt it; NI* = the donor splice site is identified and the mutation does not disrupt it but activates a cryptic donor splice site.