Show contents
Table of Contents

Marker score data requirements

Marker scores can be input in spreadsheet format (Microsoft Excel), or alternatively in comma-separated (CSV) or tab-separated (TSV) formats. In all cases, the same tabular layout is used (see below).

Summary of main requirements:

More detailed requirements, including how to specify batches for batch correction, are given below.

[Top]Example data

The following example datasets are available for download. They can be opened using Excel or a text editor to look at the file layout, and re-uploaded if you wish to try out the upload facility.

Tab separated format (.tsv)

Excel format (.xls)

See also the associated example survival data.

[Top]Input file formats

The following summarises compatible file formats:

[Top]Input data layout

Input data needs to be structured as a table (see example below). The top left cell should be labelled 'Sample'. The first column of the table is reserved for sample identifiers of 36 characters or less (e.g. a patient identifier); the first row of the table specifies marker identifiers, which must be 18 characters or less. For maximum compatibility, please stick to using ASCII letters (A-Z; a-z), numbers (0-9), dashes (-), underscores (_) and dots (.) - please note that spaces should not be used. Every sample identifier should be unique within the file; repeated marker identifier are treated as replicates (repeat measurements) of the same marker. These replicates are then averaged according to the user's specification on import (using the mean or median). Note that repeated marker identifiers must be exactly the same, including upper/lower case, to be recognised as replicates.

Cells containing scores should be specified numerically, using a dot as the decimal indicator (e.g. 123.45). Scores are allowed in the range -100 up to 1 trillion (1012) (typically scores are non-negative, however a small negative margin is allowed in case any normalisation has been applied, such as previous batch correction). Such scoring systems which can have values anywhere within a range are referred to as continuous. An example is AQUA.

Scores can also be categorical, i.e. only taking one of a fixed set of values. An example is the Allred scoring system. Categorical scoring systems with up to nine discrete values are supported. Currently, categorical datasets with more than nine different score values are not supported - they will be treated as continuous. For such datasets, consider grouping score values together to produce nine or fewer different values, which will then be detected as categorical.

Please stick to one scoring system per file so that markers are comparable.

Missing scores can be left blank, although each marker must have at least ten scores specified to be validated successfully.

Example input data:


[Top]Advanced - datasets with batches

It is sometimes not possible to perform a complete analysis using a single TMA block. Instead, multiple blocks are used, with sections from each block placed on separate slides. Batch effects or unwanted non-biological variation can arise for many reasons, for example due to differences between TMA blocks, personnel or experiment date. Such batch effects can often be reduced using ComBat, a batch correction procedure [Johnson et al. 2007 Biostatistics 8:118].

ComBat is an empirical Bayes method designed to mitigate batch effects in gene expression microarrays. As the framework is rather general, we have implemented a version of ComBat that is tailored for Tissue Microarrays. A few non-methodological changes have been made, including improved error handling and a method for automatic removal of problematic replicates.

Please note that ComBat should only be used on continuous score data (not on categorical score data such as Allred scores. A minimum of two distinct markers (not counting replicates) is also required.

In a scenario with multiple TMA sections, the batch is the section or slide number (defined arbitrarily, for example by numbering the sections 1, 2 and 3). We can also attach experimental conditions, or covariates, which may not be evenly split between the batches. Covariates are variations that we wish to preserve, for example whether a sample is a control or subject to a test condition, or if there are different histology types.

These data are supplied along with the TMA's scores. All columns which are not scores are prefixed with an asterisk (*); any such column is only used for pre-processing the TMA during import. The batch is specified in a column called *Batch. Covariates are specified in columns prefixed *cov and are given a unique name for reference, for example *covHistologyType or *covControlOrTest. Note that at least some samples from each type of covariate must be in each batch. For example, in a setup with three batches, and a covariate specifying control or test, every batch must have at least one sample labelled control and at least one sample labelled test.

Example input data with two batches and one covariate:


Note that variations in covariates must be spread across batches - in the above example, we couldn't have a batch 3 where all samples have histology 'Serous'. In such a scenario, it would be impossible to tell whether an effect is due to the batch or the histology type, and the system will generate an error.

ComBat requires a certain amount of data to estimate batch effects and so can fail to run when large amounts of data are missing. In order to help with this issue, samples with more than 80% missing data are removed prior to batch correction - an error is returned where all samples have more than 80% missing data. We have also added protein replicate filtering to the ComBat algorithm. This is essentially a recover-and-try-again procedure applied when ComBat fails to estimate the batch effects - if a particular marker/replicate appeared to cause the problem, that column is removed and ComBat is run again. This procedure may therefore result in removal of markers in an attempt to successfully perform batch correction - and is repeated and until estimation of batch effects succeeds, or an error will be returned if the batches become confounded with the covariates.

When applying batch correction to datasets with many missing scores, the sample and replicate filtering described above may result in fewer samples and, more rarely, fewer markers in the final set of scores than in the original data. We recommend you check the numbers of samples and markers listed for the dataset in TMA Navigator and compare with the numbers in the input file. You can also download the data from the dataset page if you wish to compare it with the original data directly.

Back to top