Consensus indel call sets
=========================


From a variety of indel calls made by several groups, four sets of indel calls were
produced for the trio children NA12878 (CEPH) and NA19240 (YRI):

NA12878.consensus.indelcalls
NA12878.union.indelcalls
NA19240.consensus.indelcalls
NA19240.union.indelcalls

Based on a validation experiment, these are the estimated overall and novel (non-dbSNP)
false discovery rates for each of the call sets:  (Estimated errors are 1 std. dev.)

Call set:             total:   FDR (full set):  FDR (novel SNPs):       in dbSNP129:      

NA12878 consensus     178877   0.5 +- 0.5%      6 +- 5%                 140846 (78.7%)
NA19240 consensus     148927   0 +- 1%          0 +- 4%                 105838 (71.1%)
NA12878 union         550475   20%              62%                     280541 (51.0%)
NA19240 union         637419   12%              41%                     345510 (54.2%)


Generation of the union and consensus sets
==========================================


The "union" sets list all indels that have been called for an individual.  Indels that were
called by several groups have been merged into one record.  The criterion for being considered
the same indel is that the starting positions were within 10 bp of each other.

The "consensus" sets list all indels that were called by the Sanger, Oxford and Boston
groups in the case of NA12878, and by the AB and Sanger groups in the case of NA19240.  The same
criterion for identical calls was used.


File format
===========


A representative indel call is given below (line broken in two for readability):


chr1    897034  897036  -1      G       Het     NA12878 Consensus       3       G       5       AC
GAAGGGGGTGTGCACTCACGGTAACCTTCAGTCACTGAGGAACAAACACA      GGGCCCTCCCCATGGTTCACCCGGCCCAACTTCTTCTCTGGGGACCCCAA


The colums contain the following data:

1 chromosome name
2 leftmost possible position of indel; 1-based
3 rightmost possible position
4 indel size; positive = insertion; 0 if unknown
5 sequence of indel; * if unknown
6 genotype (Hom, Het, or U for unknown/not called/no consensus available)
7 individual
8 call set
9 homopolymer run length
10 homopolymer nucleotide
11 microsattelite run length (repeat unit length <= 5)
12 microsattelite repeat unit
13 reference sequence 5' of indel site; an insertion starts right after this sequence
14 reference sequence 3' of indel site; this includes the deleted sequence (column 5) if appropriate

The position (column 2) refers to the first nucleotide of the sequence in column 14; this is the
position at which the first nucleotide of the indel sequence (column 5) is inserted into the reference
genome, or the first nucleotide of the reference that was deleted by this indel.  Note that this
definition does not necessarily agree with various other conventions used in the primary call sets,
nor does it agree with the definition used in the GLFv3 format.

For indel size (column 4) is unknown only for a subset of dbSNP indels; this happens where it is
unclear which is the reference allele, and therefore whether the indel is an insertion or deletion.

Some primary call sets do not provide the indel sequence; an asterisk is listed as the indel sequence
(column 5) when indel sequence is not available for this reason.

The genotype for the consensus is given as that provided by the primary call sets, provided that
all (2 or 3) sets agree; otherwise this entry is 'U'

The microsattelite run length is the largest stretch of nucleotides around the indel site that can be
explained by a repeating k-mer (k <= 5).  This is not necessarily a multiple of k.


Disclaimer
==========

Please note the following caveats:


* The process of combining calls from different groups is still under discussion, and may change.

* More and updated primary call sets will become available, and when this happens the consensus lists
 will be updated correspondingly.

* The inserted sequence is currently not always provided, as this information was not present in all primary
 call sets.

* The criterion for merging indel calls is a heuristic; when primary call sets provide more information, this
 criterion will be sharpened

* Indel positions have been normalized to the leftmost possible, if indel sequence was available; hence indel
 calls may not be exactly identical to the primary calls

* dbSNP concordance was calculated by requiring indel positions to match within 10 bp, and sizes to match exactly.
 For about 7% of dbSNP indel calls, the reference allele is not unambiguously determined; for these the indel
 length/type was not required to match.


Acknowledgements
================

The following people contributed indel calls, analyses, and validation data (in no particular order):


Richard Durbin
Mark Gerstein
Nancy Hansen
Fiona Hyland
Heng Li
Gerton Lunter
Gabor Marth
Gil McVean
Stephen Montgomery
Jim Mullikin
Aniko Sabo
Michael Stromberg
Eric Tsung
Zhengdong Zhang