The 1000genomes project has two mirrored ftp sites: ftp://ftp.1000genomes.ebi.ac.uk and ftp://ftp-trace.ncbi.nih.gov/1000genomes/ These follow this basic structure. At the top level there are 6 directories: data release sequence_indices alignment_indices technical changelog_details There is also a pilot_data directory which represents data from the pilot study The top directory also contains two major index files. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment.index ==== data ====: contains a subdirectory per individual of the main project. Each individual directory contains series of subdirectories for different data sets like sequence reads, sequence alignments etc. e.g ftp://ftp.1000genomes.ebi.ac.uk/data/NA12878/sequence_reads/SRR003082_1.filt.fastq.gz or ftp://ftp-trace.ncbi.nih.gov/1000genomes/data/NA12878/sequence_reads/SRR003082_1.filt.fastq.gz ==== release ==== contains dated directories which contain analysis results sets released on that date plus readmes explaining how those data sets were produced. e.g ftp://ftp.1000genomes.ebi.ac.uk/release/2008_12/ or ftp://ftp-trace.ncbi.nih.gov/1000genomes/release/2008_12/ Release directories will now be named based on the date of the YYYYMMDD.sequence.index. The SNP and indel calls etc. in these directories will be based on alignments produced from data listed in the YYYYMMDD.sequence.index file. For example, the directory ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/ contains the release versions of snp and indels calls based on the ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20100804.sequence.index file. ==== technical ==== contains subdirectories for other data sets like simulations or files for method development interm data sets etc. e.g ftp://ftp.1000genomes.ebi.ac.uk/technical/simulations or ftp://ftp-trace.ncbi.nih.gov/1000genomes/technical/simulations WARNING: Directory: technical/working This directory contains data that has experimental (non public release) status and is suitable for internal project use only. Please use with caution. === sequence_indices === This directory contains all previously produced sequence.index files. Each file begins with YYYMYDD indicating its release date. The date appearing in the name of a main project bam files link them to a corresponding sequence the sequence.index file with the same date. The most recent file should also match ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index eg. NA10851.unmapped.ILLUMINA.bwa.CEU.low_coverage.20101123.bam was constructed using NA12878 low_coverage sequence files listed in the file: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence_indices/20101123.sequence.index Each sequence.index file is accompanied by two types of statistics files ( stats.cvs and .stats) Each YYYMMDD_sequence.index.stats files contains summary information about Study/Population/Center/Sample coverage statistics for the sequence data listed in that particular file. eg.20101123.sequence.index.stats .stats files contains sequencing strategy names (exome,low_coverage) contain a subset of the summary information contained in the YYYMMDD_sequence.index.stats relating to exome/low_coverage data only. eg. 20101123.sequence.index.exome.stats 20101123.sequence.index.low_coverage.stats .cvs statistics files give the incremental changes that have occured for Population,Center and Sequencing platform from the sequence index files linked to the dates listed at the beginning of the file names. eg. the files 20101101_20101123.exome_stats.csv 20101101_20101123.low_coverage_stats.csv give summary information of the differences between the data listed in the 2010110.sequence.index and the 20101123.sequence.index files === alignment_indices === This directory contains all previously produced alignment.index files. Each file begins with YYYMYDD indicating the dated file name of the sequence.index file that alignments are constructed from. appearing in the name of a main project bam file links it data in the bam file to that listed in the date stamped sequence.index file The most recent file should also match ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index You will also find stats files: eg. 20101123.alignment.index.bas.gz These contain all the .bas files for the bam files in the release concatenated into a single file. also stats files eg. 20101123_20100901.alignment_stats.low_coverage.csv This type of file contains similar information to the stats file in the sequence_indices directory === changelog_details === In order to make the main root-level CHANGELOG human-readable, and scrollable, any changes made to the contents of the ftp site are summarised in files contained in this directory and referenced in the root-level CHANGELOG file. These files are named to reflect when and what type of changes that have occured. The types of changes are restricted to 'new','moved','replacement' or 'withdrawn'. eg. changelog_details_20110216_new changelog_details_20110216_replacement changelog_details_20110216_withdrawn changelog_details_20110216_moved ==== pilot_data ====: This represent a frozen version of the pilot data. It contains the most of the same directories as the main ftp directory and in the same form. Some notes about this directory release, during the pilot project release directories were named in the form YYYY_MM paper_data_sets, this contains copies of data associated with the pilot papers === Index files === The volume of data generated by 1000genomes project is unprecedented. To ensure all the data is easily locatable the most update to date sequence and alignment files are listed in index files ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/sequence.index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/sequence.index and ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/alignment.index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/alignment.index The format of these files are explained in ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/README.sequence.index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.sequence.index and ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/README.alignment.index ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.alignment.index These index files should provide you with sufficient data to download subsets of the files either by study, individual or technology. They also contain the md5s of the files. The main project alignment file names also contain similar information e.g data/NA12878/alignment/NA12878.chromY.SOLID.bfast.CEU.high_coverage.20100125.bam data/NA12878/alignment/NA12878.chrom20.LS454.ssaha2.CEU.exon_targetted.20100311.bam data/NA12878/alignment/NA12878.unmapped.LS454.ssaha2.CEU.exon_targetted.20100311.bam data/NA12878/alignment/NA12878.nonchrom.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam Filename components -The filename starts with the Sample name from Corelli/Hapmap. -If the alignment has been split by chromosome there will be a chromosome name. -The sequencing technology is next, ILLUMINA for illumina, LS454 for 454 and SOLID for SOLiD. -The abbreviation of the name of the aligner used ( bwa,bfast etc.). -Three letter abbreviation of the population the individual belongs to. -The analysis group of the sequence, this reflects sequencing strategy -The release date of the sequence.index file containing the list of sequence files used to construct the alignment file. ( For alignment files in /ftp/pilot_data SLX for illumina, 454 for 454 and SOLID for SOLiD The SRP is the study identifier, 31 is pilot1 low coverage, 32 is pilot2 high coverage, 33 is pilot3 gene targetted sequencing. If the filename contains 'unmapped', the bam represents reads associated with that individual which didn't map to the reference. Each bam file is accompanied by an an index (.bai) file and also a statistics file (.bas). See ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/README.alignment_data for a description of the .bas file contents. The alignments are all done in comparision to the reference found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/ pilot/data alignments are against the NCBI Build 36 reference. Main project alignments are against the GRCh37 reference. ##################################################################