Comments by Dr. T. Helleputte, Dr J. Paul, P. Gramme, Dr D. Bertrand, M. Bastin, C. Tits and Prof. P. Dupont about the paper “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” written by Christodoulou et al. in the Journal of Clinical Epidemiology 110 (2019) 12-22.
DNAlytics, Louvain-la-Neuve, Belgium
UCLouvain, EPL, ICTEAM, Louvain-la-Neuve, Belgium

 

1. Context

A paper entitled “A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models” has been published earlier this year (2019). It has been largely shared and commented on various online platforms.

Two types of comments were frequently seen, which we can summarize as follows:

  1. “This article proves machine learning is not interesting”
  2. “This article shows that the methodology used to analyse data is irrelevant”

The title of our article is a pun around the title of the original paper. Our team of data science professionnals has discussed the original paper and the related comments, and wishes to share some constructive thoughts.

2. About the paper: A provocative title leading to a misleading interpretation

Clearly the paper has a provocative title. The way it is phrased is not neutral: it is in disfavor of “Machine Learning” as opposed to “Logistic Regression”. Based on the results of the paper, the authors could have written the title that we propose here above, switching both concepts, which would then have provided the opposite (and similarly misleading…) interpretation, in disfavor of logistic regression as opposed to machine learning methods. Beyond the choice of positioning from the authors, two specific points are worth discussing.

First, a strong dichotomy between logistic regression on one side, and machine learning methods on the other side, makes little sense. A regularized logistic regression (as considered by the authors) is almost identical to a (linear) support vector machine (read more on this at the bottom of this page). Even a non-regularized logistic regression (call it, not “penalized”, if you prefer) would provide models similar to SVM provided that a sufficiently large number of samples are available (with respect to the number of measured input features, call them “independent” variables if you prefer), which is precisely a criterion of choice for the datasets considered in this paper.

We interpret the positioning of the authors as a defense of a method (Logistic Regression) traditionally used by statisticians presumably against other methods (denoted here ML methods) historically proposed in the CS/ML community but also introduced and/or used by statisticians (such as Random Forests).

This distinction is questionable and, as pointed out by these authors themselves, despite their misleading title, ML methods and statistical methods form a continuum. For instance, LASSO and many related variants have been presented and deeply studied both in the statistical and machine learning literature. We don’t feel the need to put an artificial boundary when there is none.

To us, the main difference between statistics and machine learning does not reside in a specific variant of a mathematical objective function to be optimized by a specific algorithm. It is more related to the intended purpose. To remain within the context of linear modeling, a statistician would probably focus first on the assessment, via hypothesis testing, of whether the coefficients of a model are not violating the observed distribution of observations he/she has at disposal, whereas the machine learning expert will be primarily concerned with the robust evaluation of its model generalization capabilities (to make predictions on new samples). We believe that each of them should also be interested in the primary concern of the other.

Second, when considering a sufficiently large number of studies, i.e. of datasets, the endeavour of showing the dominance of one particular (set of) models over others is a bit of a loss of energy.

Why?

Because of the “No free lunch” theorem (Wolpert and Macready, 1997) stating that any two optimization algorithms (call them learning and/or statistical estimation algorithms depending on your taste) are equally good when their predictive performance is averaged across all possible problems (technically, over any possible data distribution and/or any possible target function to be induced from data). Of course, the paper we discuss here does not consider all possible problems, but it considers already quite a few, and the authors would probably have included more studies if it would have been possible. So far as the question referred by the title of this paper is concerned, (or any other version claiming no performance benefit of X over all methods but X), the claim summarized in the original title comes with no surprise, as it was formally proven more than 20 years ago.

3. About reader’s comments: Irrelevance of methodology? The article says the opposite.

In our opinion, despite a disputable choice of title, the true merit of this paper is to highlight the massive amounts of methodological biases encountered in most publications. This could have been the main topic of the paper, and be reflected in its title. The rightfull message of the authors is that, precisely, methodology matters if one wants to rigorously assess / compare performances of different models. They list several of these biases, such as selection bias, overfitting, over-optimistic evaluation of performances, etc. Especially worrying is the fact that analyses showing some type of bias come to conclusions which globally differ from the better conducted analyses (about the relative performance of ML vs LR).

In that sense, readers stating that the data analysis methodology is irrelevant completely miss a key point of the paper ! To the opposite: all attention should be put on the methodology of data analysis roll-out, in order to test a lot of different model types for a given problem at hand (no longer for an average performance over a large collection of studies), rather than, indeed, on the very type of model used. Moreover, specifically in the context of biomedical research, one criterion chosen by the authors regarding dataset selection seems very odd: They have explicitely excluded datasets with high-dimensional data. But biomedical research is in fact mainly composed of such high-dimensional data: next-generation sequencing (DNA, RNA, …), spectrometry data, not to mention imaging data. At the same time, ethical, practical and financial constraints make the number of samples available generally very limited. In short, a vast amount of actual studies in healthcare fall into the so called small n (number of samples) very large p (number of predictors) setting, with p often order(s) of magnitude larger than n.

In this setting, model learning is even more prone to overfitting. Learning and (statistically sound) evaluation methodology is even more critical in that case, as well as more constrained models, such as regularized models and/or models incorporating relevant prior knowledge (e.g. biological knowledge) about the task at hand.

Technical appendix: How close are a regularized logistic regression and a linear support vector machine?

A regularized logistic regression consists of two parts: a loss function, which is a logistic loss, and a regularizer, which is, most commonly, an L2-norm (defining a so-called ridge penalty) over linear model parameters.

A linear support vector machine consists of two parts as well: a loss function, which is a hinge-loss, and a regularizer, which is, also commonly, an L2-norm (call it again ridge penalty if you prefer) over linear model parameters.

Logistic Regression and SVM thus have the same regularizer over the model parameters (w) :

The two losses differ, but one is merely a differentiable approximation of the other, as can be seen on the figure below.

This picture represents four different loss functions: The 0/1-loss (orange), the hinge loss (dashed purple), the logistic loss (solid purple) and the square loss (black). The hinge loss is the one used by linear support vector machines (SVM). The logistic loss is the one used by logistic regression.

In summary: estimating these models corresponds to highly similar optimization problems.