A methodology for sorting genetic data

Published on 12.08.2021

Over the last decade, we have witnessed a proliferation of genomic data, due to technological advances in large-scale analysis. This data represents a potential source of progress for diagnosis and personalized treatment of patients. The challenge is to analyze and interpret this mass of information. With the support of MSDAVENIR and the French National Research Agency, Antonio Rausell and his research team in Clinical Bioinformatics are proposing a technique to sort through this data using machine learning.  

Artificial Intelligence

Antonio Rausell, Inserm researcher and head of the Clinical Bioinformatics Laboratory at Institut Imagine, has set himself the task of better understanding genetic variants. Variants are those small genetic differences that make us unique. These modifications, which appear spontaneously in our genome, can have consequences at the molecular, cellular and personal levels, or even be the cause of diseases. One of the difficulties lies in their interpretation, as most of them have no visible effect. The genetic variants found in an individual are numerous and varied: they can be found in the coding regions - those that contain the information necessary for the production of proteins, the linchpins of the cells - or in the non-coding regions, which constitute some 98% of our genome. Long unknown, once considered useless, the role of non-coding regions is becoming increasingly important. It's like looking for a needle in a haystack," compares Barthélémy Caron, a doctoral student in Antonio Rausell's team and first author of the study. With this long-term work, we hope to discover the causes of certain genetic diseases. Because to date, for half of the 4000 rare genetic diseases identified, the causal genes or variants have not been characterized.

Exploring the dark matter of the genome

Millions of different variants - mostly benign - have already been identified. Today, the tools available give access to a large amount of data in both coding and non-coding regions. "Determining which variants, particularly in non-coding regions, may be the cause of a pathology is a real challenge for doctors and researchers," explains the researcher.

Thanks to the support of MSDAVENIR in the framework of the DEVO-DECODE project and of the French National Research Agency via the investments of the future and the C'IL-LICO project, Antonio Rausell's team has developed a methodology to identify the most influential variants in the non-coding genome of a person.

"Using machine learning and the already available dataset, our method performs an initial sort. We prioritize variants based on their potential pathogenic impact. Then, doctors and experimental teams can focus on these variants to validate whether they are really the cause of diseases," explains the researcher.

With this dive into the unknown areas of genetics, Antonio Rausell's team hopes to bring a new perspective to the study of genetic diseases, and especially to identify their molecular origin. It is crucial for patients and their families to be able to name the disease and know its cause," explains Antonio Rausell. Discovering the origin of diseases is usually a first step towards a therapeutic approach.

Now the software is open source to the entire community and the results are available online. This methodology is accessible within Imagine's bioinformatics platform, and is also available to the entire scientific community. Our approach perfectly illustrates the Imagine spirit," concludes Antonio Rausell. Our method was immediately made available to other researchers and physicians so that it could benefit patients and fulfill one of the Institute's primary missions, namely to better characterize genetic diseases and treat them."

Resources

Publication

NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans.

A methodology for sorting genetic data

Exploring the dark matter of the genome

Resources

NCBoost classifies pathogenic non-coding variants in Mendelian diseases through supervised learning on purifying selection signals in humans.