ICM - Brain tumour DATA analysis

One of the biggest challenges in the age of Machine Learning is to effectively impact the healthcare domain, bringing benefits to patient treatments and providing a toolset that doctors can use to aid their understanding in situations where AI can provide an advantage.


Statistics and medicine have a joint history spanning more than a century. Models started being successfully applied to medical datasets as far back as the late 1960s. Machine Learning methods have been applied to clinical and genetic datasets from the 1980s onwards, with a pick up in pace mirroring that in other fields over the past couple of years. This has been due to the progress in modelling and computational capabilities in the Machine Learning field, of the explosion in medical data, clinical and unstructured, as well as the lower costs for sequencing genetic data 1, 2, 3.

One of the previous research projects done by our co-founders involved partnering with the neuro-oncology department of the Pitie Salpetriere University Hospital/Pierre et Marie Curie University in Paris. Over several decades, this department has collected one of the largest, and certainly most unique, clinical datasets on brain tumors.

In 2018, brain tumors remain one of the worst types of cancers. It is difficult to treat surgically, as well as to reach with radio or chemotherapy - survival expectancy is bleak. For glioblastoma (GBM), survival expectancy from diagnosis is 3 months without treatment and 14 months with Gallego (2015)WHO (2016). The vast majority of studies applying Machine Learning to medical datasets focus on the most common cancers, particularly breast cancer. The aim of our study was to give insight into the patterns found in this dataset, for Glioblastoma, our goals being three-fold:

  1. Apply supervised learning to build survival prediction models and compare them to survival analysis methods.

  2. Perform unsupervised learning for patient clustering & data exploration.

  3. Develop data visualizations and tools that give insight into data patterns and can be of use to clinicians.


There has been significant progress in recent years on applications of Machine Learning methods to the diagnosis of brain cancer, mostly focused on medical imagery Dr. Bradley Erikson NVIDIA’s 2017 Global Impact AwardV. Panca and Z. Rustam. The recent work of J Lao et. al. published in Nature is a perfect example of applying Deep Learning to MRI data (~75 observations) for prediction of survival. However, work on applying Machine Learning to brain cancer survival prediction and clustering from clinical data is almost non existent. We came across one such study BS MA et.al., which mixed molecular and clinical data for GBM cases from the Cancer Genome Atlas database, in order to predict survival. The study achieved Area under the Curve (AUC) scores of 0.82 when mixing clinical and somatic copy-number alterations data, and of 0.98 when mixing microRNA data with clinical data. There have been several meta studies of applications of Machine Learning to oncological datasets, notably, Cruz & Wishart (2007)Kourou et. al. (2015) focused on breast cancer studies specifically, PH Abreu et.al. (2016). Cruz & Wishart found more than 1500 papers on the topic of Machine Learning and cancer.


Our dataset consisted of a series of subjects diagnosed with brain tumors over several decades. The dataset consisted of 7630 observations, including in some cases, several observations taken over time for a similar subject. The three main components where categorical variables such as gene mutations, tumor types, grade, location, surgery type, binary variables such as gender or related to patients undergoing certain treatments i.e. radiotherapy or chemotherapy and finally continuous variables such as age at surgery and life expectancy.

As with any real world dataset, our dataset had a lot of missing data, several genetic indicators lacking between 50% and 83% of the observations. Furthermore, we only had about 30% of observations with a known death date. Preprocessing of our dataset and dealing with the missing data was one of the most challenging aspects of our study. In the end, we chose two methods of handling missing data, both by imputation: MICE and Amelia. Because this is such a pervasive problem, we believed the best approach for sharing our knowledge was to write a hands-on tutorial on how to deal with missing data, recently presented by our collaborator Alex for ODSC Europe.


We present here a quick summary of some of the methods we used in our analysis. At the time of writing, we are in the process of publishing our work, which will contain the detailed steps for the entire workflow.


For hierarchical clustering we used methods based on PCAMIX (Kiers 1991) that aim to maximize a homogeneity criterion for each cluster of variables.


Another useful algorithm for clustering is COBWEB3 (above), which does incremental concept formation. For partition based clustering we used Partitioning Around Medoids (PAM), based on Gower distances, which can better compute similarities between mixed type variables.

Supervised Classification

The purpose of this analysis was to identify if we can classify patients based on their date of death, and identify how well we can predict on the test set of patients as well as how useful that is for clinicians. We used out-of-the-box classifiers such as Decision Trees, Random Forests, and various variants of Neural Networks to aim to capture the nonlinearity in the dataset. We are also building a tool that allows clinicians to inspect where the algorithms got “fooled”. Below are confusion matrices from Random Forests & Neural Networks on a test set of 418 patients.

Neural networks provide marginally better results than Random Forests. The downside of this approach is that it is a black box method and we cannot make direct statements about the importance of each feature in our dataset.


We believe that stand-alone clinical data does not have enough predictive power to allow for high accuracy in a classification or regression setting. Regardless of that, a study developing supervised and unsupervised Machine Learning methods on clinical data can give clinicians another idea of how patients are related to one another. Our goal is to provide clinicians with a simple software tool where they can interpret the results of the various methods presented.


1. MIT Review
2. Watson Health
3. Deepmind Health