  # Mixture modelling

Mixture models identify clusters of samples by modelling marker expression as a mixture, or sum, of Gaussian (normal) distributions. The Expectation-Maximisation (EM) algorithm [Dempster et al. 1977 J. Roy. Stat. Soc. B Met. 39:1] is used to fit nine different models containing from one to nine clusters. The best performing model is selected using the Bayesian Information Criterion (BIC) [Schwarz 1978 Ann. Statist. 6:461]. The result is plotted as a line graph, along with score density shown as a line (using kernel density estimation, or KDE) and a histogram.

Available for: continuous scoring

## [Top]Viewing the results

To view mixture modelling results, click on the marker's name in the tabs near the top of the page. When a marker has been selected, you will see the mixture model plot on the left, and descriptive statistics on the right.

A Gaussian mixture model is a way of modelling marker scores using a sum of Gaussian distributions (also called normal distributions). The number of Gaussians used is called the modality of the model. The methodology is summarised at the top of this page.

The mixture model plot includes a density plot and histogram, overlaid with a Gaussian mixture model - all of the same marker's scores. The centres of each cluster are shown as dashed vertical line(s), each centre-point corresponds to the average expression value (mean, mode and median are all the same for a Gaussian distribution).

The histogram and the density plot are representations of the protein expression. It is important to note that both are approximations of the underlying distribution - the histogram relies on estimating how many bars to use and where their boundaries lie, and the density line relies on kernel density estimation (KDE) using adaptive bandwidth estimation. KDE involves estimation of a parameter called bandwidth (an estimate of how smooth or irregular a distribution is). Therefore, in some cases the mixture model doesn't exactly match the KDE approximation, which could be due to a lack of precision in the KDE approximation or the mixture being a poor fit to the data, or a combination of both.

The dark blue dashed line running from left to right indicates the mixture model, which is computed by summing up the Gaussian distributions that contribute to the model. Statistics for each of these distributions are given on the right hand side. Together, these provide the mathematical formula for the mixture model.

The buttons on the right-hand side are:

• Open in new window - Open the current plot in a new window. Useful for comparing multiple plots side-by-side.
• Download groups (.tsv) - Download the list of samples for the current marker along with a list of scores, group assignments and group probabilities in tab-separated value (.tsv) format.
The columns in the file are:
• Sample - The sample identifier as uploaded with the TMA data
• Score - The marker score value for this sample
• MixtureGroup - Which model group (i.e. which Gaussian) does the sample most likely come from, using maximum likelihood?
• GroupNProb - The probability of the sample belonging to group N for each of the groups, e.g. a two Gaussian model contains columns Group1Prob and Group2Prob.
• SilhouetteWidth - The silhouette width width is a measure of clustering quality on a scale from -1 to 1. Large values (almost 1) are very well clustered, around 0 means the observation lies between two clusters, and strongly negative values (close to -1) are poorly clustered. These values will be marked NA (not applicable) for single cluster datasets.
• Download as SVG image - Download the current image in scalable vector graphic (SVG) format. SVG images can be rescaled to any size without loss of quality, ideal for posters and publications.

Statistics are provided in the box on the right-hand side as follows:

• Number of clusters (modality) - The number of Gaussian distributions summed together to produce the model.
• Mean silhouette width - The mean silhouette width provides a measure of clustering quality on a scale from -1 to 1. Large values (close to 1) indicate better clustering; values close to 0 or negative values indicate poor clustering. This value will be marked NA (not applicable) for single cluster datasets.
• For each Gaussian: - For each Gaussian in the model (denoted by a G) the parameters are listed. To take an example, in the model:
0.23 x G(μ=88.51, σ=22.10)
The values (parameters) are:
• The mixture proportion - How much this Gaussian contributes to the model. All the mixture proportions across the Gaussians will sum to one, allowing for rounding error (0.23 in the example).
• μ - The mean/mode/median - The centre of the cluster (Gaussian), the same as the mode and median. This value is 88.51 in the above equation.
• σ - The standard deviation - The standard deviation of the Gaussian distribution. This is a measure of the width of the Gaussian's 'bell curve' (22.10 in the above equation).