DESCRIPTION OF INTERIM ANALYSIS =============================== An interim analysis of Phase I data was carried out based on the 2010.08.04 sequence index, which included 629 sequenced samples. For this analysis the data were mapped with BWA and then processed through the main project alignment processing pipeline at the Sanger Institute (Sendu Bala) and at TGen (David Craig) and then analyzed in parallel at the Broad Institute using GATK / BEAGLE (Mark De Pristo, Ryan Poplin, Eric Banks) and at the University of Michigan using glfMultiples / MACH (Hyun Min Kang). Each of these pipelines generated a list of variant sites, haplotypes and high quality genotypes for each sample (we estimate >97.5% accuracy at heterozygous sites, based on comparisons to HapMap3 genotype data and other array genotype data). In addition, the data were mapped with MOSAIK and then processed through an alternative alignment processing pipeline at NCBI (Chunlin Xiao) and then analyzed in parallel at Boston College using FreeBayes (Al Ward) and the NCBI using glfMultiples / FreeBayes (Chunlin Xiao, Tom Blackwell, Hyun Min Kang). These two alternative pipelines currently generate lists of variant sites but no haplotypes and high quality genotypes, which requires a linkage-disequilibrium aware analyses when samples are sequenced at low depth. We derived a high quality set of variant sites by focusing on sites identified by 2 of these 4 analyses (Broad Institute, Michigan, Boston College and NCBI). Genotype calls use the Broad Institute genotype calls, by default, and the Michigan genotypes, when these are not available. These are provided in a single, indexed VCF file per continental analysis panel (Some samples, in particular those from HapMap populations ASW, MXL and PUR were deliberately included in more than one continental analysis panel and, thus, were analyzed more than once). GENOTYPES AND HAPLOTYPES ======================== The genotypes vcf file contains the R2 values from each of the populations where available. The specific genotypes and haplotypes called by The Broad and Umich can also be found in the supporting directory The Broad institute genotypes and haplotypes are available from: supporting/AFR.BI_withr2.20100804.genotypes.vcf.gz supporting/ASN.BI_withr2.20100804.genotypes.vcf.gz supporting/EUR.BI_withr2.20100804.genotypes.vcf.gz A parallel set of files including genotype and haplotype calls from Michigan is available from: supporting/AFR.UMICH.20100804.genotypes.vcf.gz supporting/ASN.UMICH.20100804.genotypes.vcf.gz supporting/EUR.UMICH.20100804.genotypes.vcf.gz Each of these files includes phased haplotype files for all the individuals that originate (broadly) from the continental regions of Africa (AFR), Asia (ASN) and Europe (EUR). Individuals with admixed ancestry may be listed more than once (for example, an individual expected to have a mixture of European and African ancestry might be included in both the EUR and AFR sets). EVALUATION OF IMPUTATION QUALITY ================================ For imputation analyses, we current recommend using the Broad haplotypes as templates (Christian Fuchsberger, Bryan Howie). These haplotypes appear to support higher quality imputation than previous 1000 Genome analyses and include many more markers. The table below summarizes a brief assessment of imputation quality (using the GAIN psoriasis data by Nair et al, 2009): IMPUTATION ACCURACY (average r2 with experimental genotypes) REFERENCE PANEL MAF 1-3% MAF 3-5% MAF >5% 1000G pilot, chromosome 20 ~0.69 ~0.77 ~0.91 1000G August 2010, Michigan, chr. 20 ~0.61 ~0.71 ~0.91 1000G August 2010, Broad, chr. 20 ~0.73 ~0.78 ~0.92 1000G August 2010, Broad, chr. X (females) ~0.65 ~0.74 ~0.90 1000G August 2010, Broad, chr. X (males) ~0.69 ~0.79 ~0.90 Whereas the 1000G pilot data for the CEU and YRI samples was phased using HapMap trio genotypes, the current interim analyses of Phase I data use only population based phasing information. The main shortcoming of the approach is clear in examining the Michigan haplotypes as templates for imputation based analyses – we expect larger haplotype sets to outperform smaller sets, but in this case the pilot data is better than the August 2010 Michigan haplotypes. The Broad haplotypes perform better and outperform the pilot project haplotypes, as expected. We are currently investigating alternative phasing approaches for the main Phase I analyses. IMPUTATION REAGENTS =================== For those interested in carrying out imputation based analysis using project haplotypes, input files suitable for use with IMPUTE, MACH/MINIMAC and BEGALE are available from the websites for those programs. For convenience, links to these files are provided here: BEAGLE ------ Website: http://faculty.washington.edu/browning/beagle/beagle.html BEAGLE Format Haplotypes: http://faculty.washington.edu/browning/beagle/AFR.1000Genomes.20100804.beagle.zip http://faculty.washington.edu/browning/beagle/ASN.1000Genomes.20100804.beagle.zip http://faculty.washington.edu/browning/beagle/EUR.1000Genomes.20100804.beagle.zip IMPUTE (version 2) ------------------ Website: https://mathgen.stats.ox.ac.uk/impute/impute_v2.html IMPUTE Format Haplotypes: https://mathgen.stats.ox.ac.uk/impute/ASN.1000Genomes.Dec2010.haplotypes.tgz https://mathgen.stats.ox.ac.uk/impute/AFR.1000Genomes.Dec2010.haplotypes.tgz https://mathgen.stats.ox.ac.uk/impute/EUR.1000Genomes.Dec2010.haplotypes.tgz MACH / MINIMAC -------------- Website: http://www.sph.umich.edu/csg/abecasis/MaCH http://genome.sph.umich.edu/wiki/minimac MACH Format Haplotypes: ftp://share.sph.umich.edu/1000genomes/fullProject/2010.08.04.WG.haplotypes/ASN.BI_inMerged2of4intersection_20100804.tgz ftp://share.sph.umich.edu/1000genomes/fullProject/2010.08.04.WG.haplotypes/AFR.BI_inMerged2of4intersection_20100804.tgz ftp://share.sph.umich.edu/1000genomes/fullProject/2010.08.04.WG.haplotypes/EUR.BI_inMerged2of4intersection_20100804.tgz THIS FILE ========= This file was prepared by Goncalo Abecasis (goncalo@umich.edu) with input from the 1000 Genomes Project Analysis Group.