ON WHOLE GENOME CLASSIFIER PERFORMANCE IN RELATION TO 16S CLASSIFIERS

September 15, 2022

Categories: Dissertation Defense Tags: Bioinformatics and Computational Biology

Doctoral Candidate Name: James Johnson
Program: Bioinformatics and Computational Biology
Defense Date and Time: September 30, 2022 – 10:00 AM
Defense Location: Zoom: https://charlotte-edu.zoom.us/j/97413134504
Committee chair’s Name: Dr. Anthony Fodor
Committee Members: Dr. Cynthia Gibas, Dr. Richard White, Dr. Alex Dornburg, Dr. Jacelyn Rice-Boayue, Dr. Todd Steck, Dr. Anthony Fodor
Abstract:

There is little consensus in the literature as to which approach for classification of Whole Genome Shotgun (WGS) sequences is most accurate. In this defense, two of the most popular classification algorithms, Kraken2 and Metaphlan2, were examined using four publicly available datasets. Surprisingly, Kraken2 reported not only more taxa but many more taxa that were significantly associated with metadata. By comparing the Spearman correlation coefficients of each taxa in the dataset against more abundant taxa, it was found that Kraken2, but not Metaphlan2, showed a consistent pattern of classifying low abundance taxa that were highly correlated with the more abundant taxa. Neither Metaphlan2, nor 16S sequences that were available for two of four datasets, showed this pattern. These results suggest that Kraken2 consistently misclassified high abundance taxa into the same erroneous low abundance taxa. These “phantom” taxa have a similar pattern of inference as the high abundance source. Because of the ever-increasing sequencing depths of modern WGS cohorts, these “phantom” taxa will appear statistically significant in statistical models even with a low classification error rate from Kraken2. These findings suggest a novel metric for evaluating classifier accuracy.