Correlation analysis

The window 'Correlation analysis' can be accessed by the menu 'Analysis' clicking 'Correlation analysis' or by the corresponding shortcut in the tool bar.


This menu has been conceived to study pair correlations among molecular descriptors and analyse graphically the molecule distribution in the space defined by some selected molecular descriptors. Moreover, if quantitative user-defined variables like experimental properties are available, a graphical correlation analysis can be performed together with a selection of all the molecular descriptors that maximally correlate with one of the external variables. This helps the user to identify which descriptors have some dependence on a given molecular property and could be useful in modelling it.


The window 'Correlation analysis' has two tabbed sections:

correlation list
pair correlation and correlation profile


Correlation list

The tab 'Correlation list' allows the user to compare the Pearson correlation coefficient between one selected molecular descriptor and all the others. After having selected the variable of interest, three main lists of variables will be provided. The first list from the left will include all the variables with correlation coefficient smaller than that indicated as the control value, the second list should be include all the variables that are almost orthogonal to the selected variable, with small absolute correlation coefficient, whereas the third list will include all the variables with strong positive correlation:

1.First of all, a variable has to be chosen by means of the drop-down lists at the top of the tab as the reference to be compared with.
2.Then, three correlation threshold values have to be selected in the corresponding drop-down lists. The variable lists are automatically filled in after the threshold values have been selected.

The three thresholds proposed by Dragon are:

correlation coefficient larger than 0.80 (direct linear dependence);
correlation coefficient (absolute value) smaller than 0.10 (no linear dependence);
correlation coefficient smaller than –0.80 (inverse linear dependence).


At the top right of the window, the shortcut for save allows exporting a reduced data matrix, which only includes those molecular descriptors that have absolute correlation coefficient greater than a specified value (i.e., the molecular descriptors collected in the first and the third lists). Note that this tool is mainly useful when working with molecular properties to be modeled, since it allows preliminary screening of descriptors for QSAR modelling, by selection of those molecular descriptors that give the best univariate correlations with the property of interest.




Pair correlation and correlation profile

The section 'Pair correlation' is dedicated to the graphical analysis of data distribution in the space defined by two selected variables and allows the user to calculate the Pearson correlation coefficient between two selected molecular descriptors or between one descriptor and one user-defined property. Variables can be selected among the calculated molecular descriptors or the loaded external variables, if any. This tool may be useful for preliminary analysis of variables, to discover interesting variable relationships and find out outliers among molecules.


A 2D scatterplot is generated by default. To generate the 2D scatterplot, two variables need to be selected through the four drop-down lists that facilitate the search for the variables of interest by previous selection of the blocks they belong to. Once the pair of variables has been selected, their Pearson correlation coefficient is shown together with the total number of molecules with missing values.


There are some chart options that can be accessed by the pop-up menu displayed by the mouse right click on the plot. Double clicking on data points, the structure view window will be opened. The molecule names can be also displayed by placing the mouse on the points in the scatterplot.


If the user has checked the checkbox 'View correlation profile' in the pop-up menu that can be displayed by right mouse clicking, a bar chart is provided showing the correlation coefficients between a selected descriptor and all the others. In the bar chart, every bar ends at the value of the correlation coefficient between the selected variable with another one. To generate the correlation profile, just one variable has to be selected by means of the first two drop-down lists (one for block selection and one for single variable).


Additional options for 'Correlation profile' plot:

show/hide correlation lines that represent the correlation thresholds set in the 'Correlation list' tab;
sort the correlation values in the bar plot in three different ways (original order, ascending, descending);


When the correlation profile has been displayed, the options available for bar marks are: none (default), Variable ID (of the variable the correlation coefficient is referring to), and Correlation value (that means the correlation coefficient value). When displaying the scatterplot, the options for point labels are: none (default), Molecule ID (i.e., the numerical molecule identifier), and Molecule name.