Discriminant analysis

The purpose of discriminant analysis is to predict value of categorical dependent variable based on values of predictors. To some extent it is similar to multiple linear regression, but outcome variable is categorical with 2 or more categories instead of continuous like in linear regression. Discriminant analysis includes several assumptions, such as predictor variables must be normally distributed, categories of outcome variable must be well-defined and mutually excluded, sizes of groups (categories) of dependent variables should be similar and should be at least five times the number of predictors.

Discriminant analysis produces several linear equations (discriminant functions), number of which corresponds the number of categories of dependent variable, or the number of categories minus one:

,

where D – discriminant function, v – the discriminant coefficient of weight of predictor variable, X – predictor variable, k – constant, i – the number of predictors.

Discriminant function is aimed to maximize distances between the categories. In general, there are two approaches in discriminant analysis. SPSS software produces number of discriminant functions equal to the number of categories of dependent variables minus one, and then membership of future cases is determined by closer location of its function to a mean value of present functions of categories. In some other statistical software (for example, in STATISTICA) the number of discriminant functions is equal to the number of categories and during predicting membership of new case, it is accepted to belong to that category for which the value of function is higher. 

Selected examples on application of discriminant analysis in microbiological studies

Dependent variable

Predictor variables

Reference

Modes of action of antifungal compounds

Metabolic footprints of Saccharomyces cerevisiae

Allen et al., 2004

Sources of fecal contamination

Disk diffusion inhibition zones for different antimicrobials

Kaneene et al., 2007

In vitro antimalarial activity of new compounds against liver stages of Plasmodium

Topological indices of chemical compounds

Mahmoudi et al., 2008

Presence of systemic candidiasis

IgG responses against the antigens of Candida albicans

Clancy et al., 2008

Models of subacute ruminal acidosis induction in dairy cattle

Duration of rumen pH below 5.6, free rumen lipopolysaccharide, rumen microbial community dynamics, and serum haptoglobin

Khafipour et al., 2010

Spores samples of Bacillus cereus grown on different media

Cellular fatty acid methyl ester profiles

Ehrhardt et al., 2010

Intestinal bacterial community composition, prevalence of butyrate production pathway genes, and occurrence of Escherichia coli virulence factors in ileum-cannulated growing pigs

Nonstarch polysaccharides fractions

Metzler-Zebeli et al., 2010

Phage types of Salmonella enteritidis

Fourier transform infrared spectra

Preisner et al., 2010

Mycobacterium tuberculosis culture-positive sputum

Different electronic nose technology

Kolk et al., 2010

Animal source of Escherichia coli

Virulence factors

David et al., 2010

Performing discriminant analysis in SPSS

Let us discuss the following example.

We want to develop a model for prediction of presence of infection in peripancreatic fat tissue in patients with pancreonecrosis before operation, based on data of clinical and laboratory examination of a patient. The presence of microorganisms in peripancreatic fat (infectious pancreatits) is difficult to diagnose before operation because it is difficult to take sample and to do bacteriological examination before surgical opening of site of infection. However, early knowledge about infection may be useful in the choice of antibiotic treatment scheme. In order to build prognosis model we retrospectively analyzed data of clinical-laboratory examination in 68 patients with pancreonecrosis on admission to hospital and compared the results of examination with post-operative diagnosis. Thus, the patients were divided into two groups – with sterile pancreonecrosis (21 patients) and with infectious pancreonecrosis (47 patients). In order to build the model we have selected only the variables in values of which there were significant differences between these two groups. Among different clinical and laboratory parameters only two appeared to have significant differences in sterile and infected patients – ratio between eosinophiles and lymphocytes in peripheral blood (ELR) and assessment of patient’s severity state by the Simplified Acute Physiology Score III (SAPS III) (Metnitz et al., 2005). Finally, we have produced the following dataset (see the table below). Patients with sterile forms were designated as group “0” and with infective forms as group “1”.

Variables with significant differences in patients with sterile and infectious pancreonecrosis during admission to a hospital

N

Group of patients

ELR

SAPS III

1

0

0.000

51

2

0

0.167

51

3

1

0.083

47

4

0

0.100

36

5

0

0.091

44

6

1

0.077

55

7

1

0.077

49

8

1

0.133

58

9

1

0.400

54

10

1

0.000

47

11

0

0.091

43

12

1

0.000

49

13

1

0.162

47

14

1

0.067

62

15

1

0.000

47

16

1

0.400

47

17

1

0.000

57

18

1

0.059

51

19

0

0.167

33

20

1

0.167

45

21

1

0.000

41

22

1

0.000

47

23

1

0.154

47

24

1

0.000

47

25

1

0.111

60

26

0

0.000

53

27

1

0.063

51

28

1

0.000

53

29

0

0.074

42

30

1

0.000

51

31

1

0.000

50

32

0

0.121

42

33

1

0.017

60

34

1

0.000

51

35

0

0.400

47

36

1

0.071

42

37

1

0.000

55

38

0

0.042

42

39

0

0.000

50

40

1

0.000

57

41

0

0.029

42

42

0

0.273

52

43

1

0.000

60

44

1

0.000

51

45

0

0.250

50

46

1

0.200

42

47

0

0.091

50

48

1

0.029

54

49

1

0.024

42

50

1

0.000

53

51

0

0.100

49

52

1

0.125

42

53

1

0.250

55

54

1

0.000

42

55

1

0.056

47

56

0

0.000

53

57

1

0.000

42

58

0

0.333

60

59

1

0.000

63

60

1

0.000

63

61

1

0.125

55

62

0

0.063

42

63

0

0.111

33

64

1

0.000

60

65

1

0.000

55

66

1

0.000

49

67

1

0.028

56

68

1

0.294

56

To start discriminant analysis in SPSS:

1) Click the Analyze menu, point to Classify, and select Discriminant… :

Discriminant analysis in SPSS

The Discriminant Analysis dialog box opens:

Selecting variables

2) Select the grouping variable (“Infection”); click the upper transfer arrow button . The selected variable is moved to the Grouping Variable: list box.

3) Click the Define Range button. The Discriminant Analysis: Define dialog box opens:

Groups of dependent variable

4) Enter the lowest and highest codes for the groups (in our example it is 0 and 1) in correspondent boxes. Click the Continue button, this returns you to the Discriminant Analysis dialog box.

5) Select the predictor variables (“ELR” and “SAPSIII”); click the next transfer arrow button . The selected variables are moved to the Independents: list box.

6) By default Enter independents together method is selected, but if necessary Use stepwise method may be selected. Let us leave default settings.

7) Click the Statistics… button. The Discriminant Analysis: Statistics dialog box opens:

Selecting displayed statistics

8) Select Means, Univariate ANOVAs, Box’s M, Unstandardized and Within-Groups Correlation check boxes. Click the Continue button, this returns you to the Discriminant Analysis dialog box.

9) Click the Classify… button. The Discriminant Analysis: Classification dialog box opens:

Specifying other parameters

10) In the Prior Probabilities section select Compute from group sizes, in the Display section select Summary table and Leave-one-out classification, in the Use Covariance Matrix section leave selected by default Within-Groups, in the Plots section select all plots. Click the Continue button, this returns you to the Discriminant Analysis dialog box.

11) Click the Save button. The Discriminant Analysis: Save dialog box opens:

Parameters to be saved

12) Select all check boxes and click the Continue button.

13) Click the OK button in the main dialog box. An Output Viewer window opens with results of discriminant analysis.

Results of discriminant analysis are displayed in a number of tables and figures. Like for every analysis, the table Analysis Case Processing Summary contains the information about cases included in the analysis:

Discriminant analysis results

The next table is Group Statistics with means and standard deviations of the predictor variable in the every group of outcome variable. Look at this table gives rough preliminary information about differences between values of predictor variables between groups, however, without indicating statistical significance:

Discriminant analysis results

The Tests of Equality of Group Means provides further information about significance of differences between values of predictor variables in the groups. If there are no differences, it is not worthwhile to continue discriminant analysis. In our example significant differences present only in the variable “SAPS III”. However, significance of ELR is also close to 0.05, therefore, there may be a sense to keep this variable in the model and to compare results with and without it:

Discriminant analysis 

The Pooled Within-Groups Matricestable evaluates intercorrelations between variables. In our example intercorrelation is very low (0.022) which support an idea about including both variables in the analysis:

Discriminant analysis results 

In contrast to ANOVA, where basic assumption is the equality of variances for each group, in discriminant analysis basic assumption is the equality of variances-co-variances matrix. The null hypothesis states that matrix of covariances is equal between groups formed by the dependent variable. If the null hypothesis is proved and differences are not significant, it indicates desired result. Box’s M test is aimed to test this assumption.

The section Box’s Test of Equality of Covariance Matrices contains necessary data to assess equality of covariance matrices:

Box’s test of equality of covariance matrices

 

Box’s test of equality of covariance matrices

Significance level 0.692 indicates the desired result that the null hypothesis can be accepted. However, even when p level of M is significant (p<0.05) using large dataset, it is not regarded as very important drawback. If there are several compared groups and M is significant, then groups with very small absolute values of log determinants should be deleted and discriminant analysis repeated without them.

The next section of results is Summary of Canonical Discriminant Functions. It contains tables with eigenvalues and with Wilks’ Lambda:

Summary of canonical discriminant functions

The table Eigenvalues provides information about produced discriminant functions. In SPSS number of discriminant functions is equal to number of groups minus one. We have only two groups (with sterile and infective pancreatitis) and because of this only one function (equation) is generated. The canonical correlation represents the multiple correlation between the predictors and the produced discriminant function. When there is only one function, canonical correlation is an index of overall model fit and it can be interpreted the same like R2 in linear regression as the proportion of variance explained. Our model explains very few variance: canonical correlation is 0.418, that is only 17.5% of total variance is explained by our model, and in future we should think how to improve the model.

Wilks’ Lambda to some extent is an opposite parameter to squared canonical correlation, it provides proportion of variance which is not explained by the model (82.5%). It also shows significance of the model (p = 0.002), therefore, in spite of poor describing total variance, significance of our model is still very high.

The next set of tables contains values of the discriminant model itself:

Discriminant model tables

The Standardized Canonical Discriminant Function Coefficients table displays coefficients which indicate relative importance of each variable in the model. These coefficients have similar sense with beta coefficients of multiple regression. SAPS III is stronger predictor than ELR because its absolute value of standardized coefficient is higher. The sign indicates direction of the relationship: increase of SAPS III value enhances probability to fall into infective group, while for ELR relationship is opposite, high values are more favourable for a patient.

The Structure Matrix table shows the correlation of the variable with discriminant function and it is a one more way to assess importance of variables in the model. By many researchers it is considered as more accurate than the previous table and is preferred because of this. Results shown in the structure matrix are similar: SAPS III seems to be better predictor than ELR.

The table Canonical Discriminant Function Coefficients contains the unstandardized coefficients for the dicsriminant model, similar to B coefficient in the multiple regression. Therefore, our disrmininant function (D) can be written as:

.

The next table is called Function at Group Centroids. Centroids are mean values of discriminant function for each group. This table is useful in supplementing the above-written equation, it gives information how to associate value of discriminant function with prognosis for each individual case. Cases with values near centroids are predicted to belong to that group, for example, patient with discriminant function -0.5 is predicted to belong to sterile group rather than to infected group. Cut-off may be defined as the difference between centroid values, that is “-0.375” (-0.678 – 0.303 = -0.375).

The Classification Statistics set of tables describes result of cases classification:

Classification statistics

The Classification Processing Summary table summarizes processed and excluded cases. The Prior Probability for Groups table assesses probability of a case to fall into any of groups by chance accordingly to the ratio between sizes of groups. It indicates accuracy of prognosis by chance without using any models. For example, if we assign a random patient to an infected group, in 69.1% of cases this will be correctly.

Among results of discriminant analysis there are also two histograms called Canonical Discriminant Function with distribution of discriminant scores for each group. These charts help to assess how well groups are discriminated. When there is an overlap in distributions, the discriminant function is not very effective, like in our example:

Distribution of discriminant scores in groups 

Distribution of discriminant scores in groups

And, finally, the table Classification Results demonstrates whether classification of cases performed with built discrmininant model is successful. We can see that among infected patients 91.5% were classified correctly, while among sterile – only 23.8%:

Classification table

During specification of parameters for discriminant analysis we selected to save additional variables, which now appeared in the Data Editor. The “Dis_1” variable contains group membership of each case, the “Dis1_1” variables has discriminant scores, the “Dis1_2” and the “Dis2_2” variables contain probability to belong to either first (sterile) or second (infected) group, respectively:

Saved variables

Information on which variable contains which data present in the labels of variables in the Variable View tab of the Data Editor.

Our discriminant model appears to be well in prognosis of presence of infected form, however, results for sterile forms were unsatisfactory, and, therefore, it is rational to try improving of the prognosis. We may try to add some additional variables or to apply another method of prognosis, for example logistic regression.