This directory contains files associated with the variant calling carried out for the phase1 
of the 1000 genomes project and other ancillary files associated with the analysis for phase1.

The phase1 analysis results directory contains a number of sub directories with different content. 
These are listed here.

Ancestry Deconvolution 

This directory contains information about the local ancestry inference which has been carried 
out on the ad-mixed populations found in the 1000 genomes phase1 samples. These are the African 
Americans (ASW), Colombians (CLM), Mexicans (MXL) and Puerto Ricans (PUR).

Consensus Call Sets 

These directories contain the consensus call sets and genotype likelihoods which were used to 
produce the final integrated release. Please note the indel file in this directory still contains 
indels which were subsequently filtered out of our integrated data release due to validation efforts. 
These can be identified by looking at the excluded_indel_sites directory under supporting

Experimental Validation 

This directory contains information about which sites were validated for the different variant 
types and the results of the validation processes.

Functional Annotation 

This contains two directories, annotation_sets contains bed and gtf files which describe the 
gene and non coding annotation which our variant sets were compared with and 
annotation_vcfs that contains the actual variant annotation in vcf format.

Input Call Sets 

This directory contains all the union call sets for the snps (both low coverage and exome), 
indels and deletions that make up the integrated release. The directory contains several vcf files, 
in each file any variant whose filter column reads PASS should be part of the integrated release.

Integrated Call Sets 

This directory contains our final variant calls for the phase1 data sets. The majority of the data 
in this directory is identical to what can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20110521 
but there are also chrY calls for snps and deletions and chrMT calls for snps found here.

shapeit2_phased_haplotypes/

This directory contains our final variant calls on the autosomes rephased using the SHAPEIT2 algorithm from
Olivier Delaneau and Jonathan Marchini

http://mathgen.stats.ox.ac.uk/impute/impute_v2.html

Paper

This directory contains the pdf files of the Nature Paper 
An integrated map of genetic variation from 1092 human genomes
Nature 491, 56–65 (01 November 2012) doi:10.1038/nature11632

The paper is distributed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported licence.  

Please share our paper appropirately.

Supporting 

ancestral_alignments, Ancestral fasta files based on a 32 way alignment from Ensembl 59 based on the 
Enredo Pecan Ortheus pipeline
axiom_genotypes, Genotypes from the Affymetrix Axiom platform for 1000 genomes samples 
cryptic_relation_analysis, The results of the Cryptic Relatedness Analysis performed by Jim Nemesh 
at the Broad Insititute 
excluded_indel_sites, The list of indels which were excluded from the v3 integrated variant release
exome_pull_down, The target coordinates used for both variant calling and the downstream analysis 
of the exome data 
omni_haplotypes, Genotypes from the Illumina Omni 2.5M Chip for 1000 genomes individuals 
accessible_genome_masks, Mask files defining which regions of the genome are more or less 
accessible to the next generation methods used by the 1000 Genomes 
Project
variant_gerp_scores, Conservation scores for all snp and indel variant sites
highly_differentiated_sites, An excel spreadsheets listing highly differentiated sites
both between super populations and between sub populations within a super population 

Many of the files in this directory are VCF files. This is our vcf file naming convention

Population.region.description.YYYYMMDD.variant_type.analysis_group.[sites|genotypes|haplotypes].vcf.gz

Population, This gives the 3 letter code for the population, If the file represetns all possible
individuals in the set ALL is used
region, This is the chromosome name, all the genomes (sometimes this is just the autosomes and chrX) is wgs.
The full exome is wex.
description, This is a string which describes the file creation/contents
YYYYMMDD, This is a date in the format year month day. This mostly represents the sequence index date that
the variant call set is based on. If the file is not based on our alignments the date should represent when
the file was created
variant_type, This described what sort of variant the file contains, snps, indels or SVs
analysis_group, This states if the data is based on low coverage, exome or other stratergies
[sites|genotypes|haplotypes], This describes if the file contains just a sites list or additional info