The Human Splicing Finder system
The splicing machinery is one of the most complex ribonucleoprotein assembly of the cell. It allows the maturation of pre-messenger RNA through the excision of introns. This process is mediated by various signals contained in the RNA molecules and thus in the DNA sequence itself.
The HSF system was designed to allow the identification of all splicing signals including acceptor and donor splice sites, branch points and auxiliary splicing signals (ESE and ESS).
Because some of these signals (ESE and ESS) are still poorly characterized, HSF combines multiple algorithms and matrices in a one-stop-shop for splicing signals in order to provide users with a wide range of information about signals contained in any human genome sequence. In addition, as many disease-causing mutations are now known to affect these signals, HSF is able to efficiently predict the impact of any mutation (SNP, insertions, deletions) on these signals.
Because the "perfect is the enemy of good", it appears that some non-specialist users have some difficulties to handle this flow of information. To solve this issue, the HSF system now contains an expert system to digest all predictions and provide users with an easy to understand pathogenicity prediction for any intronic or exonic mutation potentially affecting splicing.
This system is the reference system for splicing analysis as illustrated by the thousands of citations in the literature.
As Illustrated in the picture above, the MaxEntScan and the HSF algorithms perform the best in all situations (mutations affecting canonical splice sites, deep intronic mutations activating cryptic splice sites, exonic mutations with positive (green) and negative (red) datasets). Even if MaxEntScan algorithm displays a high false positive rate in some situations, it has been combined with HSF matrices in the new expert system.
In addition, the new expert system has been improved to predict the impact of mutations on ESE/ESS and has moved to a high level of accuracy for this still difficult to predict situations.
In addition to previous evaluations, we performed a new evaluation of HSF accuracy using data from BRCA1/2 mutations extracted from reference databases (ARUP, ClinVar and BRCA-Share™).
As illustrated, the accuracy, Matthews Correlation Coefficiency (MCC) and log(DOR) are reproducibly very high for mutations affecting donor or acceptor splice sites.
For mutations affecting auxiliary splicing sequences, using a dataset of 50 mutations demonstrated experimentally to affect or not ESE/ESS, the new HSF algorithm demonstrated a 0.83 sensitivity, a 0.81 specificity and a 0.82 accuracy reaching unprecedented levels of accuracy (0.48 for ESEFinder with the same dataset).
The HSF system is accessible for regular users through a software as a service license. It can be integrated in any bioinformatics pipeline through web services. Bioinformatics companies willing to include HSF into their systems have various options to do it and should contact us.