description |
Background: Enrichment of biologically interesting loci by DNA hybridization followed by high-throughput sequencing has become an important tool in modern genetics, especially for finding disease causing mutations. Currently, the most common capture target is the Consensus CDS (CCDS). The CCDS, however, excludes many actual or computationally predicted coding exons present in other databases, such as RefSeq and Vega, and non-coding functional elements such as untranslated and regulatory regions. The dynamics of capture sequencing outside of the CCDS regions is consequently less well understood. Results: We examine capture sequence data outside of the CCDS regions and find that extremes of GC content in different subregions of the genome can reduce the local coverage to less than 50% relative to the CCDS. Further, we show that while this effect is primarily due to biases inherent in both the Illumina and SOLiD sequencing platforms it is exacerbated by the capture process. Interestingly, for 2 subregion types, miRNA and predicted exons, the capture process seems to favor high relative coverage. Lastly, we examine the mutational spectrum of non-CCDS regions and find that predicted exons, as well as exonic regions specific to RefSeq and Vega, show much higher variant frequencies than the CCDS. Predicted exons, strikingly, show a variant frequency of 1/660bp, more than twice the rate of the CCDS and 30% higher than the overall genomic rate. Conclusions: We show that regions outside of the CCDS capture less efficiently than the CCDS itself, and that variant frequencies vary dramatically in different biologically important loci. |