Consensus indel call sets ========================= From a variety of indel calls made by several groups, four sets of indel calls were produced for the trio children NA12878 (CEPH) and NA19240 (YRI): NA12878.consensus.indelcalls NA12878.union.indelcalls NA19240.consensus.indelcalls NA19240.union.indelcalls Based on a validation experiment, these are the estimated overall and novel (non-dbSNP) false discovery rates for each of the call sets: (Estimated errors are 1 std. dev.) Call set: total: FDR (full set): FDR (novel SNPs): in dbSNP129: NA12878 consensus 178877 0.5 +- 0.5% 6 +- 5% 140846 (78.7%) NA19240 consensus 148927 0 +- 1% 0 +- 4% 105838 (71.1%) NA12878 union 550475 20% 62% 280541 (51.0%) NA19240 union 637419 12% 41% 345510 (54.2%) Generation of the union and consensus sets ========================================== The "union" sets list all indels that have been called for an individual. Indels that were called by several groups have been merged into one record. The criterion for being considered the same indel is that the starting positions were within 10 bp of each other. The "consensus" sets list all indels that were called by the Sanger, Oxford and Boston groups in the case of NA12878, and by the AB and Sanger groups in the case of NA19240. The same criterion for identical calls was used. File format =========== A representative indel call is given below (line broken in two for readability): chr1 897034 897036 -1 G Het NA12878 Consensus 3 G 5 AC GAAGGGGGTGTGCACTCACGGTAACCTTCAGTCACTGAGGAACAAACACA GGGCCCTCCCCATGGTTCACCCGGCCCAACTTCTTCTCTGGGGACCCCAA The colums contain the following data: 1 chromosome name 2 leftmost possible position of indel; 1-based 3 rightmost possible position 4 indel size; positive = insertion; 0 if unknown 5 sequence of indel; * if unknown 6 genotype (Hom, Het, or U for unknown/not called/no consensus available) 7 individual 8 call set 9 homopolymer run length 10 homopolymer nucleotide 11 microsattelite run length (repeat unit length <= 5) 12 microsattelite repeat unit 13 reference sequence 5' of indel site; an insertion starts right after this sequence 14 reference sequence 3' of indel site; this includes the deleted sequence (column 5) if appropriate The position (column 2) refers to the first nucleotide of the sequence in column 14; this is the position at which the first nucleotide of the indel sequence (column 5) is inserted into the reference genome, or the first nucleotide of the reference that was deleted by this indel. Note that this definition does not necessarily agree with various other conventions used in the primary call sets, nor does it agree with the definition used in the GLFv3 format. For indel size (column 4) is unknown only for a subset of dbSNP indels; this happens where it is unclear which is the reference allele, and therefore whether the indel is an insertion or deletion. Some primary call sets do not provide the indel sequence; an asterisk is listed as the indel sequence (column 5) when indel sequence is not available for this reason. The genotype for the consensus is given as that provided by the primary call sets, provided that all (2 or 3) sets agree; otherwise this entry is 'U' The microsattelite run length is the largest stretch of nucleotides around the indel site that can be explained by a repeating k-mer (k <= 5). This is not necessarily a multiple of k. Disclaimer ========== Please note the following caveats: * The process of combining calls from different groups is still under discussion, and may change. * More and updated primary call sets will become available, and when this happens the consensus lists will be updated correspondingly. * The inserted sequence is currently not always provided, as this information was not present in all primary call sets. * The criterion for merging indel calls is a heuristic; when primary call sets provide more information, this criterion will be sharpened * Indel positions have been normalized to the leftmost possible, if indel sequence was available; hence indel calls may not be exactly identical to the primary calls * dbSNP concordance was calculated by requiring indel positions to match within 10 bp, and sizes to match exactly. For about 7% of dbSNP indel calls, the reference allele is not unambiguously determined; for these the indel length/type was not required to match. Acknowledgements ================ The following people contributed indel calls, analyses, and validation data (in no particular order): Richard Durbin Mark Gerstein Nancy Hansen Fiona Hyland Heng Li Gerton Lunter Gabor Marth Gil McVean Stephen Montgomery Jim Mullikin Aniko Sabo Michael Stromberg Eric Tsung Zhengdong Zhang