Correlation analysis

Correlation and regression are similar and sometimes one may be confused. In some cases it is reasonable to perform both calculations, however, in others only one of them can make sense.

Correlation is used in the situations when we measured two variables (X and Y) for each case and want to quantify how well they are associated. Correlation makes no assumption whether any of variables depend on other; it does not study with the relationship between variables and only describes association between them. In contrast, regression analysis is aimed to describe dependence of a variable on one or more explanatory variables.

Association between variables is, for example, strong correlation between length of the cell and length of the flagellum attachment zone (Zhou Q et al., 2011).

In turn, an example of a regression can be cited from the study of Zhao et al. (2000) who evaluated dependence of time-to-detection of growth of Clostridium botulinum on inoculum size of spores, concentration of sodium chloride and initial pH of the medium. This dependence was expressed by an equation of polynomial regression:

log₁₀(time-to-detection) = 4.910 - 0.533Linoc + 0.139NaCl + 0.055Linoc² - 0.068pH²,

where Linoc = log(inoculum size),

NaCl = percent of sodium-chloride concentration,

pH = initial pH of the medium.

Correlation is not changed if variables are swapped: in case of positive correlation increase in values of variable X will increase the values of variable Y and vice versa. However, regression has strictly one-way casual effect: if activity of antibiotic increases with decrease of incubation temperature, it does not mean that temperature may be influenced anyhow by activity of antibiotic.

Correlation answer the question on how strongly two variables are related with each other; it expresses degree of association between two variables. It is measured by the coefficient of correlation. As for other statistical methods, there are parametric and nonparametric coefficients of correlation.

SPSS computes bivariate correlation between pairs of variables, partial correlation between two variables but with simultaneous control for the effects of one or more additional variables, and also big number of distances between either variables or cases which assess their similarities or dissimilarities with producing matrix of distances. This matrix can be further used in other procedures such as cluster analysis, factor analysis, etc.

Coefficient of correlation

The coefficient of correlation (r) falls in the ranges from -1 to 1, where coefficient equal to 1 means that values change strictly in the same way, while coefficient equal to -1 means that values of variables change strictly in opposite way; correlation coefficient equal to zero expresses total absence of relationship between two variables:

Correlation coefficient: different values

In general, there are three types of correlations:

1) Positive and negative correlation: when one variable moves in the same direction as another, it is positive correlation; or when it moves in opposite direction – negative.

2) Linear and nonlinear correlations: when both variables are changed in the same ratio, it is linear correlation; where there is a relationship between variables but they do not change at the same ratio, correlation is nonlinear.

3) Simple, partial and multiple correlations: if a study includes two variables which are associated with each other, it is simple correlation. If there are three variables, one of which is a factor variable and with respect to that factor variable, the correlation is considered, then it is a partial correlation. When multiple variables are considered for correlation, then they are called multiple correlations.

Value of correlation coefficient can be interpreted as strong, moderate or weak; however, there are no uniform criteria of these interpretations. One of the commonly used classifications defines strong correlation if coefficient is in ranges from 0.7 to 1, from 0.3 to 0.7 – moderate and from 0 to 0.3 – weak.

To interpret results of correlation analysis it is important to know coefficient of correlation and its significance.

Coefficient of determination

Squared coefficient of correlation is called coefficient of determination (r²). It quantifies the proportion of the variance of one variable explained (in statistical sense) by the other; it is also the percent of the data that is the closest to the line of best fit. In other words, the coefficient of determination reflects how well the regression line will represent the data. If the regression line on the scatterplot passes exactly through every point, it would be able to explain all of the variation and determination coefficient would be near 1.0. Oppositely, the further the line is away from the points on scatterplot, the less it is able to explain.

If correlation coefficient between variables is 0.7, then determination coefficient will be 0.7*0.7 = 0.49, that is, 49% of variability in values of one variable is explained by another variable.

Scatterplots in analysis of correlation

Before computing correlation coefficients the data should be checked for the presence of outlier and for linear character of association. This can be easily done with scatterplot of data. For example, linear fitting does not totally explain relation between variables in the Figure:

Nonlinear relations

however, cubic fitting is ideal for expressing this relation:

Nonlinear relations

Understanding possible nonlinear relation can be done in the easiest way by visual examination.

Another important application of scatterplot is to reveal outliers which may significantly influence the values of correlation coefficient. For example, if both upper values are considered as outliers then correlation coefficient decreases from 0.47 to 0.39:

Correlation coefficient depending on outliers

For parametric data the most widely Pearson coefficient of correlation is used, for nonparametric – Kendall’s tau-b or Spearman’s coefficients of correlation. If the parametric and nonparametric correlation coefficients are applied to the same data, results will differ: the correlation coefficient on Fig. 6.3 was Pearson’s one, while corresponding values for Spearman coefficient are 0.61, 0.65 and 0.60. As we can see, nonparametric correlation coefficients are less influenced by outliers. These coefficients are less sensitive to outliers because they measure rank order rather than values themselves and can easier reveal presence of any association between variables. However, when non-linear association is found, it is often useful to try transformation of variables, for example, log-transformation or any other, to make the relation linear because there are more predictive models available for linear relationships, and they are generally easier to implement and interpret.

Performing correlation analysis in SPSS

Returning to our example on activity of essential oils measured by disk diffusion method, correlation analysis can be useful in answering the research question whether there is an association between activity of thyme oil and tea tree oil. By this example we will demonstrate the use of parametric Pearson’s coefficient of correlation for which data with normal distribution are required. In the previous chapters we determined that results of disk diffusion study of essential oils activity had normal distribution and we will continue now the work with this dataset (see Example 1).

To specify correlation analysis:

1) Click the Analyze menu, point to Correlate, and select Bivariate… :

The Bivariate Correlations dialog box opens:

2) Select the variables between which correlation should be examined (“Tea tree” and “Thyme”); click the transfer arrow button . The selected variable is moved to the Variables: list box.

3) Because our variables have normal distribution, we will use parametric Pearson correlation coefficient which is selected by default.

4) Be assured that in the section Test of Significance Two-tailed is selected, it expresses double-side significance which is generally more convenient.

5) By default Flag significant correlations is selected, in this way significant correlations will be marked with asterisks “*”, number of which will reflect level of significance, which is very convenient during looking at results of statistics.

6) Click the OK button. An Output Viewer window opens and displays results of correlation analysis:

The Output Viewer window contains Correlations table, each row of which represents one variable (either “Tea tree” or “Thyme”) and columns also represent the same variables. Cell which is located at crossing between different variables is used for assessing results. It contains tree lines with numbers – correlation coefficient itself (-0.622), significance (p value – 0.001) and number of analyzed cases (24). Therefore, our results mean that there is moderate negative correlation between activity of tea tree and thyme oils studied by disk diffusion method.

Let us also build scatterplot to visualize present correlation.

To specify building a scatterplot:

1) Click the Graphs menu, and select Chart Builder… :

The Chart Builder dialog box opens:

2) In the Gallery tab select Scatter/Dot, gallery of Scatterplot diagram appears, in it select Simple Scatter (in the upper left corner of the gallery) and drag it into the Chart preview uses example data window.

3) From the Variables list select the variable “TeaTree” and drag it onto the place for the X-axis. Then select the variable “Thyme” and drag it onto the place for the Y-axis:

4) It may be useful to label observations on diagram in order to identify outliers. Click the Groups/Point ID tab and select the Point ID label check box.

5) From the Variables list select the variable “N” and drag it onto the place “Point Label Variable?”:

6) Click the OK button. An Output Viewer window opens and displays scatterplot:

From resulting scatterplot we see that there are no outliers. Negative correlation is also seen well at scatterplot. It may be interesting to analyse if there are any visual differences in scattering of data belonging to two species – E. coli and E. faecalis. This can be done by selecting Grouped Scatter diagram in the Scatter/Dot gallery (second in upper row of the Gallery). Apart from selecting variables for axes, grouping variable (in our example “Species”) should also be selected and dragged to the place marked as “Set color”:

After clicking the OK button we can see scatterplot in the Output Viewer window:

By default different groups are marked with different colours but type of marker is the same (circles). It is not always convenient, especially in black-and-right printing, and by opening the Chart Editor we can change type of markers. To do this:

1) Double click the scatterplot; the Chart Editor opens:

2) In the legend (the right upper corner of the Chart Editor window) select the marker which you want to change and double click it. The Properties window opens with active tab Marker:

3) In the section Marker change Type of marker from circular to, for example, rhomboid:

Also you may change size of marker, for example, from 5 to 6.

4) In the section Color click the Fill button and select one of colours, next click the Apply button.

5) In the Chart Editor window we may simultaneously track all changes. Close the Chart Editor and return to the Output Viewer window with modified scatterplot:

In the grouped scatterplot we can see that correlations between studied oils most probably will be different depending on species of bacteria, because markers for E. faecalis are mostly located in the upper left part of the chart, while for E. coli – in the right central part. Therefore, there is a sense to determine it separately for E. coli and for E. faecalis. It is easy to achieve this by clicking the Split button () in the Data Editor window and selecting Organise Output by Groups and selecting the “Species” variable as group separator. After repeating all steps for specifying correlation analysis, the Output Viewer window will contain results for correlation analysis depending on species of bacteria. Results now demonstrate that there is a significant negative correlation between activities of oils against E. faecalis, which is equal to -0.603. However, against E. coli correlation is not significant:

blog comments powered by DISQUS back to top

Next >