Advanced data science to enhance decision-making in healthcare, based on complex but scarce data.
Our core expertise is data sciences applied to health care. We build predictive models and identify what (combination of) markers (in a broad sense) should contribute to these models. These models enable making predictions in clinical research, biomanufacturing or public health.
In most cases, when large companies such as Google, Amazon, Facebook, Apple, … discuss the concept of big data, they refer to their own situation where many observations (millions of users) are available, each of them however described by a limited set of features. In this context, statistical learning, i.e. Machine Learning, is quite an easy task.
In the Healthcare context, there is usually a huge number (potentially billions) of parameters, while at the same time, there is generally a very limited number of observations (patients, batches). That is why at DNAlytics we rather talk about Fat Data. This is a very hard context to « learn » something with data science methods.
That is why we developed our unique technology: DNAlytics develops and uses special algorithms to cope with this specific healthcare context.
Data sciences is a set of technologies related to several aspects of data, as can be seen below. We describe the different facets of data sciences below.
Artificial Intelligence is a very trendy part of data sciences. It has no strict definition, but one we like is “AI is a set of techniques allowing a computer to perform tasks usually only achievable by a human”. In a medical context, it would be typically to diagnos a patient, to interpret medical images, to plan a treatment path, to predict response to a treatment… It encompasses many different sets of approaches, namely constraint programming, adversarial search, but also Machine Learning. Most of the field of Bioinformatics is generally associated to AI as well, sometimes arguably.
Machine Learning or ML is a set of techniques allowing a computer system to learn a concept based on a set of examples. In the case of medical diagnosis, this would be a set of patients data (-omics and clinical data), along with a diagnostic label (e.g. diseased or healthy). The ML system is then trained to recognize new diseased or healthy patients which it has never seen before. By doing so, it will also learn on which variables, which biomarkers, it has to focus to make these predictions. Exactly the same technique is used on biomanufacturing data to build a digital twin (making predictions about the yield of a production process) and identify key process drivers (e.g. among raw materials characteristics, bioreactor sensor features, …).
Bioinformatics are a set of computer sciences techniques used to deal with and analyse mostly molecular biology data (DNA, RNA, epigenetics, protomics, …). They prove very usefull in drug development and in some clinical development programs.
For Machine Learning and Bioinformatics applications, we make heavy use of R programming language, a reference in data sciences. Upon this open-source layer, we build our own pieces of software, some in open-source, some in closed-source. Based on our own software libraries, we then build our customer applications. Once data science results are obtained (new predictive models, biomarkers, decision rules, risk scoring, …) their value increases if they can be made available to and actionable by healthcare professionals, i.e. non-data scientists. That is why we also provide software development capabilities, in order to implement these results.
This technology is effective and recognized, as demonstrated by more than 40 publications by DNAlytics collaborators. Some of our open-source libraries are also downloaded more than 2000 per month.
When it come to some specific types of data, such as images or video, a subfield of ML is particularly effective: Deep Learning. Deep Learning is an evolution of the Artificial Neural Networks from the eighties. What makes them different? They are much more complex (more “neurons”), but most of all they make heavy use of convolutions, as mathematical tool. Also, huge progresses have been made because data (images, namely) are much more accessible than decades ago, in huge amounts. Specific frameworks have been developed with a high level of expressiveness, such as Keras and Tensorflow, which we use as well when appropriate. In such a case, Python language is generally the language of choice (although R can achieve about the same).
Being able to analyse data and apply AI approaches, data must be available in a clean state. A large part of data sciences projects consists in retrieving data, formatting data, translate data, clean data. To do that, we would typically use Bioconductor for bioinformatics data, Simple ITK for imaging data, Open Clinica in clinical trials settings, or custome developments in R or Python, just to name a few.
And of course, first of all, data must be accessible. They generally come from our customers, but we are generally able to complement these data with multiple sources of data.
We take advantage of publicly available data, and combine heterogeneous data sources to complement the data provided by our customers.
To obtain the computing power we need, we make heavy use of cloud computing solutions, such as the Amazon Web Services (AWS). We are an AWS certified Partner.
These infrastructures allow us to take advantage of the most recent computing technologies, including GPUs (a.o. those from NVidia), an alternative to more classical CPUs. Both GPUs and CPUs show specificities making each of them the best choice for different kinds of mathematical operations. Deep Learning in particular makes heavy use of GPUs.
The deployment of data sciences full applications to make them available in practice also requires to master several IT frameworks, such as Docker, R Shiny or Conda / Anaconda.
For more information about our own software libraries, go to the dedicated page.
See a non-exhaustive list of publications (scientific communications, patents, software libraries) to which DNAlytics collaborators have contributed.