Insertions and deletions (indels) represent the second largest variation type in human genomes and have been implicated in the development of cancer. Accurate indel annotation is of paramount importance in variants analysis in both healthy and disease genomes. Previous studies have shown that existing indel calling methods generally produce high false positives and false negatives, which limits the downstream investigation of the roles of indels in structural and functional effects.
To assess the accuracy of indel calling programs, we carried out a comparative analysis by evaluating 7 general indel calling programs and 4 somatic indel calling programs, using 78 healthy samples from the 1000 Genomes Project and 30 cancer samples from The Cancer Genome Atlas (TCGA). We adopted a comprehensive and more stringent indel comparison approach, and an efficient way to use a benchmark for improved performance comparisons for the general indel calling programs. We found that germline indels in healthy genomes derived by combining several indel calling tools could help remove a large number of false positive indels from individual programs without compromising the number of true positives. The performance comparisons of somatic indel calling programs are more complicated due to the lack of reliable and comprehensive benchmark.
We further performed a comparative analysis of somatic indels in two cancer types, BRCA and LUAD. We compared somatic indels in both coding and non-coding regulatory regions such as transcription factor binding sites (TFBSs). We used an improved algorithm to predict TFBSs in human genomes and analyzed their evolutionary and structural roles. Our comparative results indicated that while there are differences between LUAD and BRCA genomes, both of them show a higher deletion rate, coding indel rate and frame-shift indel rate. Somatic indels tend to locate in sequences with important functions, including both coding and non-coding regions. This study can serve as the first step in future pan-cancer analysis for identifying key variant markers of cancer genomes.