Compositional data refers to any data that represents parts of a whole, and DNA sequencing data is compositional in nature. This is due to the constraint on our current sequencing technologies that allow us to record a sample of the sequences rather than recording all the sequences. This means that sequencing data breaks the assumption of independence (Gloor et al., 2017). It has been long known that analysis of compositional data is challenging and can lead to spurious correlations. However, DNA sequencing data is inherently noisy due to both limitations of sequencing technology and its biological nature. Read depth, the number of sequencing reads from each sample, is known to be a confounding factor in many studies also plays a role in creating artifacts in this type of data. In this work, we demonstrate that read depth drives variance in four different datasets and propose a method for quantifying artifacts generated by read depth. We use this new method to compare untransformed data, several compositionally aware transformations, and a transformation which we call “lognorm” that normalizes samples by read depth in log space. Ultimately, we find that lognorm consistently had less read depth artifacts than the other transformations.
One way to determine the value of a data transformation is to show that it improves the performance of a machine learning classifier. We compared several common transformations to see if they improve the accuracy of a random forest and found that lognorm consistently significantly improves the accuracy of random forest. We believe that lognorm improves accuracy by reducing read depth artifacts and allows the MLA to learn from smaller signals within the data.