The centered log-ratio (CLR) transformation may be the most common approach to deal with compositional data, such as microbiome sequencing data. To help form intuition on what the CLR transformation does and how to interpret it, let’s start by taking a look at the mathematical notation.
Let’s say we have a microbiome sample which we will treat as a vector called \(\bf{x}\) with size \(D\). We’ll refer to the taxa - or more generally the elements - of this vector \(\bf{x}\) as \({x}_1\) - \({x}_D\). Then, CLR-transforming that vector \(\bf{x}\) would look like this:
\[clr({\mathbf{x}}) = \left \lbrace \ln \left (\frac{ {x}_{1}}{G({\mathbf{x}})} \right), \dots, \ln \left (\frac{ {x}_{D}}{G({\mathbf{x}})} \right) \right \rbrace\]
Where \({G({\bf x})}\) is the geometric mean of \(\bf{x}\). Let’s go through it step by step.
You can calculate the geometric mean of a set of n numbers by multiplying them together and then taking the nth root. Just like the ‘regular’ mean, the geometric mean says something about the center of your data.
Essentially what this says is that in order to get the CLR-transformed values of a vector, you take every element of that vector, divide it by the geometric mean of the entire vector and then take the natural logarithm of the result and you’re done.
We can deduce a few things about this transformation.
Furthermore, there are a few points to keep in mind when interpreting CLR-transformed values.
The CLR-transformation is not a perfect solution for compositionality - in fact the idea of a solution to a type of data seems a little odd - but in practice the CLR-transformation tends to be a handy tool on the belt of a bioinformatician. Understanding what exactly it does will greatly improve its utility and reduce the chance of misinterpreting an analysis.
Read count data from metagenomic sequencing experiments are affected by a multiplicative, sequence-specific bias. This is because sequences from different taxa can be subjected to vastly different conditions before sequencing and may also differ in how effectively they can be bound & sequenced by modern platforms.
Often, we have no way of estimating the per-read bias. However, we do know it is multiplicative (and not additive). Therefore, we can try to remove the bias by dividing each feature by its geometric mean. This is also known as geometric mean centering.
Let’s imagine a microbiome feature table \(X\), with \(i\) samples as rows and \(j\) features as columns:
\[ X_{i,j} = \begin{pmatrix} x_{1,1} & x_{1,2} & x_{1,3} & \cdots & x_{1,j} \\ x_{2,1} & x_{2,2} & x_{2,3} & \cdots & x_{2,j} \\ x_{3,1} & x_{3,2} & x_{3,3} & \cdots & x_{3,j} \\ \vdots & \vdots &\vdots & \ddots & \vdots \\ x_{i,1} & x_{i,2} & x_{i,3} & \cdots & x_{i,j} \end{pmatrix} \]
Because we are dealing with compositional data, the values themselves are not informative, only the ratios between values (components) within a composition.
We can divide each element of \(\bf{x^{*}_{j}}\) by its geometric mean \(G({\bf x^{*}_{j}})\) to geometric mean center each feature. This is different from the CLR-transformation, as we center over the features here instead of centering over the samples. In other words:
\[ centered(X_{i,j}) = \begin{pmatrix} \frac{x_{1,1}}{G({\bf x^{*}_{1}})} & \frac{x_{1,2}}{G({\bf x^{*}_{2}})} & \frac{x_{1,3}}{G({\bf x^{*}_{3}})} & \cdots & \frac{x_{1,j}}{G({\bf x^{*}_{j}})} \\ \frac{x_{2,1}}{G({\bf x^{*}_{1}})} & \frac{x_{2,2}}{G({\bf x^{*}_{2}})} & \frac{x_{2,3}}{G({\bf x^{*}_{3}})} & \cdots & \frac{x_{2,j}}{G({\bf x^{*}_{j}})} \\ \frac{x_{3,1}}{G({\bf x^{*}_{1}})} & \frac{x_{3,2}}{G({\bf x^{*}_{2}})} & \frac{x_{3,3}}{G({\bf x^{*}_{3}})} & \cdots & \frac{x_{3,j}}{G({\bf x^{*}_{j}})} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ \frac{x_{i,1}}{G({\bf x^{*}_{1}})} & \frac{x_{i,2}}{G({\bf x^{*}_{2}})} & \frac{x_{i,3}}{G({\bf x^{*}_{3}})} & \cdots & \frac{x_{i,j}}{G({\bf x^{*}_{j}})} \end{pmatrix} \] We can center microbiome data per batch to more reasonably compare between batches, as we remove the batch-specific multiplicative bias. Additionally, centering may have favourable properties for summed log-ratio analysis, where we sum together components to make amalgamations.
Aitchison distance is invariant to centering by feature geometric means:
set.seed(1)
#Simulate simple microbiome data with 10 features, 20 samples and 10.000 reads per sample.
counts <- t(rmultinom(n = 20, 10000, prob = runif(10, min = 0.01, max = 1)))
#CLR-transform each sample:
counts.clr = t(apply(
counts, 1, FUN = function(x) { log(x) - mean(log(x)) }
))
#Alternatively, first center each feature:
cent_counts = apply(
#Here I divide by the geometric mean in log-space, to avoid rounding issues with small numbers.
counts, 2, FUN = function(x) {exp( log(x) - mean(log(x)))}
)
#And then CLR-transform as before:
cent_counts.clr = t(apply(
cent_counts, 1, FUN = function(x) { log(x) - mean(log(x)) }
))
#Aitchison distance is the same:
all.equal(
c(dist(counts.clr, method = "euclidean")),
c(dist(cent_counts.clr, method = "euclidean"))
)
## [1] TRUE
(Meaning beta diversity will not change, nor will differential abundance)