Differences between LDA, QDA and Gaussian Naive Bayes classifiers

    Deep dive in the modelling assumptions and their implications

    While digging in the details of classical classification methods, I found sparse information about the similarities and differences of Gaussian Naive Bayes (GNB), Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA). This post centralises the information I found for the next learner.

    Summary: All three methods are a specific instance of The Bayes Classifier, they all deal with continuous Gaussian predictors, they differ in the assumptions they makes about the relationships amongst predictors, and across classes (i.e. the way they specify the covariance matrices).

    The Bayes Classifier

    We have a set X of p predictors, and a discrete response variable Y (the class) taking values k = {1, …, K}, for a sample of n observations.

    We encounter a new observation for which we know the values of the predictors X, but not the class Y, so we would like to make a guess about Y based on the information we have (our sample).

    The Bayes classifier assigns the test observation to the class with the highest conditional probability, given by:

    Bayes Theorem

    Where: pi_k is the prior estimate, f_k (x) is our likelihood. To obtain probabilities for class k, we need to define formulae for the prior and the likelihood.

    The prior. The probability of observing class k, i.e. that our test observation belongs to class k, without having further information on the predictors. Looking at our sample, we can think of the cases in class k as realisations from a random variable with Binomial distribution:

    Distribution of number of successes (observations in class k) in a set number of trials n (size of our sample).

    where for n trials, in each trial the observation either belongs (success) or does not belong (failure) to class k. It can be shown that the relative frequency of successes — the number successes over the total number of trials — is an unbiased estimator for pi_k. Hence, we use the relative frequency as our prior for the probability that an observation belongs to class k.

    The likelihood: The likelihood is the probability of seeing these values for X, given that the observation actually belongs to class k. Hence, we need to find the distribution of the predictors X in class k. We don’t know what the “true” distribution is, so we can’t “find” it, we rather make some reasonable assumptions about how it might look like, and then use our sample to estimate its parameters.

    How to choose a reasonable distribution? One clear division arises between predictors that are discrete and continuous. All three methods assume that within each class,

    Predictors have a Gaussian distribution (p=1) or Multivariate Gaussian (p>1).

    General form of Multivariate Gaussian distribution conditional on class k.

    Hence, these algorithms can be used only when we have continuous predictors. In fact, Gaussian Naive Bayes is a specific case of general Naive Bayes, with a Gaussian likelihood, reason why I’m comparing it with LDA and QDA in this post.

    From now on, we’ll consider the simplest case able to showcase the differences between the three methods: two predictors (p=2) and two classes (K=2).

    Linear Discriminant Analysis

    LDA assumes that the covariance matrix across classes is the same.

    Meaning that predictors in class 1 and class 2 might have different means, but their variance and covariance is the same. Meaning that the “spread” and relationship between predictors is the same across classes.

    Visualisation of classes distribution based on LDA assumptions

    The plot above was generated from distributions for each class of the form:

    Distributions of predictors in classes 1 and 2 for LDA

    where we observe that the covariance matrices are the same. This assumption is reasonable if we expect the relationship between predictors to not change across classes, and if we simply observe a shift in the means of the distributions.

    Quadratic Discriminant Analysis

    If we relax the constant covariance matrix assumption of LDA, we have QDA.

    QDA does not assume constant covariance matrix across classes.

    Visualisation of classes distribution based on QDA assumptions

    The plot above was generated from distributions for each class of the form:

    Distributions of predictors in classes 1 and 2 for QDA

    Where we observe that the two distribution are allowed to vary in all the parameters. This is a reasonable assumption if we expect the behaviour and relationships amongst predictors to be very different in different classes.

    In this example, even the direction of the relationship between the two predictors varies from class 1 to class 2, from a positive covariance of 4, to a negative covariance of -3.

    Gaussian Naive Bayes

    GNB is a specific case of the Naive Bayes, where the predictors are continuous and normally distributed within each class k. The general Naive Bayes (and hence , GNB too) assumes:

    Given Y, the predictors X are conditionally independent.

    independent implies uncorrelated, i.e. have covariance equal to zero.

    Visualisation of classes distribution based on GNB assumptions

    The plot above was generated from distributions for each class of the form:

    Distributions of predictors in classes 1 and 2 for GNB

    With Naive Bayes we are assuming there is no relationship between predictors. In real problems this is rarely the case, nevertheless it considerably simplifies the problem, as we will see in the next section.

    Implications of Assumptions

    After selecting a model, we estimate the parameters of the within-class distributions to determine the likelihood of our test observation, and obtain the final conditional probability we use to classify it.

    The different models result in a different number of parameters being estimated. Reminder: we have p predictors and K total classes. For all models we need to estimate means of the Gaussian distribution of the predictors, that can be different in each class. This results in a base, p*K parameters to be estimated for all methods.

    Additionally, if we pick LDA we estimate the variances for all p predictors and covariances for each pair of predictors, resulting in

    Number of parameters to be estimated with LDA

    parameters. These are constant across classes.

    For QDA, since they differ in each class, we multiply the number of parameters for LDA times K, resulting in the following equation for the estimated number of parameters:

    Number of parameters to be estimated with QDA

    For GNB, we only have the variances for all predictors in each class: p*K.

    It is easy to see the advantage of using GNB for large values of p and/or K. For the frequently occurrent problem of binary classification, i.e. when K=2, this is how the model complexity evolves for increasing p for the three algorithms.

    So What?

    From a modelling perspective, knowing what assumptions you’re dealing with is important when applying a method. The more parameters one needs to estimate, the more sensitive is the final classification to changes in our sample. At the same time, if the number of parameters is too low, we’ll fail to capture important differences across classes.

    Thank you for your time, I hope it was interesting.

    All images unless otherwise noted are by the author.

    Differences between LDA, QDA and Gaussian Naive Bayes classifiers Republished from Source via

    Recent Articles


    Related Stories

    Stay on op - Ge the daily news in your inbox