This readme describes the alignment data available on the ftp site, how it is processed and what summary information is available for each alignment. A. File format and alignment index All alignment data is in the BAM format. This is the binary version of the SAM format which is described here: http://samtools.sourceforge.net/. Each BAM file has two associated files: a index file which has the same name but ends with .bai and a statistics file which has the same name but end with .bas. The format of bas file is described further down in in section D. The alignments are found under data/XXXXXXXX/alignment and data/XXXXXXX/exome_alignment where the XXXXXXXX identifier is the sample name. There is an alignment.index and an exome.alignment.index which links the alignment bam files together with their matching bai file and bas file and gives md5sums for each. The columns are as follows: 1. bam file 2. MD5 3. bai index file 4. bai index file md5 5. bas statistics file 6. bas statistic file md5 Directory alignment_indices/ contains the most current alignment.index (the one with the latest date and identical to the alignment.index one level up), as well as alignment.index files of previous BAM releases. The directory also contains the following files about BAM statistics: yyyymmdd.alignment.index.bas.gz - a collective bas file of all BAM files. yyyymmdd_yyyymmdd.alignment_stats.csv - a summary statistics of BAMs in any two releases as specified by the dates in the file name; a comparison of the two releases is captured in the "diff" values of the file. The file contains the following information break down by platform. 1. mapped basepairs in Gb 2. # of individuals in the release 3. # of individuals with > 10 Gb of mapped sequences mapped basepairs in Gb is also shown as break down by population at the bottom of the file. 20091216.alignment_stats.csv - a summary statistics of the very first release of main project BAMs. The exome alignments also have a HsMetrics files which contain the results from the picard tool CalculateHsMetrics The bam filenames themselves contain a lot of information, e.g: NA12878.mapped.ILLUMINA.bwa.CEU.low_coverage.20121211.bam The name can be broken down into 7 pieces: 1. Sample name: this matches column 10 in the sequence.index file and represents the individual all the sequences belong to. 2. Region, this is generally either mapped or unmapped, the mapped files represents all the mapping to the reference genome, the unmapped file represents all the unmapped reads. You may also see chromN files which represent mappings to just that chromosome. 3. Sequencing platform: all our current release should be ILLUMINA but old alignment files will have SOLID or LS454. 4. Mapping algorithm: For ILLUMINA this is bwa, again older files may have bfast for solid or ssaha or smalt for 454. 5. Population*: the Sample population 3 letter code, this code is defined in README.populations 6. Analysis group*: this describes the sequencing strategy used for the sequence being aligned, those strategies being 'low coverage', 'high coverage', 'exon targeted' and 'exome'. 7. Date in the format YYYYMMDD: this should match the date from the sequence.index file which was used to produce the alignments. This date will also be in the alignment index name. Over time one release of alignments will contain multiple index dates. The index itself is named for the most recent sequence.index it is based on. The index in the bam file name reflects the last sequence index new data was added for that individual *Note: For historical reason, in the first release of BAM files, fields 5 and 6 are replaced with a single column of project name such as SRP000033. One example is NA20828.chrom7.ILLUMINA.bwa.SRP000033.20091216.bam B. Alignment Process All the most recent alignments were produced by Richard Durbin's group at the Sanger based on our analysis.sequence.index file which contains all our ILLUMINA sequence data which has 70bp or longer reads. Older alignment releases also contain SOLID and 454 sequence data and their alignment process is explained in the readme which sits with those alignments Illumina data was aligned with bwa v0.5.9 in 4 steps: 1. Index the reference fasta: bwa index -a bwtsw $reference_fasta 2. For each fastq file: bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file 3. Create SAM files using bwa sampe or samse for paired-end or unpaired reads respectively. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size. bwa sampe -a $max_insert_size -f $sam_file $reference_fasta $sai_files $fastq_files bwa samse -f $sam_file $reference_fasta $sai_file $fastq_file 4. Create BAM from SAM, sort, fix mate information and add the MD tag: samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file Bam Improvement The run-level alignment BAMs are improved in various ways to help increase the quality and speed of subsequent SNP calling that may be carried out on them. For Illumina BAMs the following improvements were performed: 1. Reads undergo local realignment around known Indels using GATK IndelRealigner. java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s) java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing where the known Indels are from the following: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz 2. Realigned BAMs have their read qualities recalibrated with GATK CountCovariates and TableRecalibration. java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $realigned_bam_file -recalFile recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile recal_data.csv -I $realigned_bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing where the known sites for recalibration are from dbSNP135: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.dbsnp.build135.snps.sites.vcf.gz 3. samtools calmd is applied to the recalibrated BAMs, which fixes the NM tags and introduces BQ tags which can be used during SNP calling. samtools calmd -Erb $recalibrated_bam_file $reference_fasta > $bq_bam_file For Exome Solid BAMs, only step 2 was performed. For low coverage Solid BAMs, only steps 1 and 2 were performed. Release bam file production The improved BAMs are merged together to get the release BAM files available for download. Release BAM files therefore contain reads from multiple readgroups. Production proceeds broadly as follows: 1. Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them, to reduce total file size by around 30%. 2. Tag-stripped BAMs are merged to the library level with Picard. (merging to a given level means to create a single BAM file containing all the readgroups that share a given member of that level; in this case it means a BAM will be made for each library, each library BAM containing all the reads from the run-level BAMs associated with that library). java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT 3. PCR duplicates are marked in library-level BAMs using Picard MarkDuplicates. java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT 4. Duplicate-marked library-level BAMs are merged to the platform level. java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT 5. Platform-level BAMs are split into chromosomes (and other, see above) BAMs, which are the release files. C. QA of the alignment data by DCC Platform-level bams for individuals are checked by a QA process by DCC before they are released to the ftp site. Here lists the QA measurements and criteria: 1. Check md5 for each file to make sure the file was not corrupted during transfer. 2. Check coverage status. For exome runs this is done by runnung calculate_HsMetrics (a function in PICARD) to evaluate EXOME sequence coverage. Samples with less than 70% 20x coverage in targetted regions are considered coverage too low. The bait file used in this calculation is ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/exome_pull_down_targets_phases1_and_2/20120518.consensus.annotation.bed For low coverage samples we use the bas files supplied with the bams. These contain the number of mapped bases and the number of duplicate bases. We calculate the non duplicated mapped coverage. This must be 3x or greater (consdering a genome size of 2.75Gb) 3. Check for excessive number of short insertions or deletions (<=6bp). If the ratio of short insertion counts to short deletion counts is greater than 5, the sample fails. The reverse ratio of short deletion counts to short insertion counts is also checked the same way. (This uses a custom script, if you would like a copy please email info@1000genomes.org) 4. Check for sample contamination using VerifyBAMId developed by Hyun Kang in University of Michigan. VerifyBamId is a software that verifies whether the reads in particular file match previously known genotypes for an individual (or group of individuals), and checks whether the reads are contaminated as a mixture of two samples. verifyBamID can detect sample contamination and swaps when external genotypes are available. When external genotypes are not available, verifyBamID still robustly detects sample swaps. Genotype data for most phase1 and 2 samples typed on OMNI chips and genotype data for most phase3 samples typed on Affymetrix chips were used in this QA process. The original OMNI genotype vcf files can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20120131_omni_genotypes_and_intensities/Omni25_genotypes_2141_samples.b37.vcf.gz and Affy genotype data can be found at /nfs/1000g-archive/vol1/ftp/technical/working/20121128_corriel_p3_sample_genotypes/. We divided the vcf files by sample population when ran VerifyBamId. The CHIP_MIX and FREE_MIX cutoff ranges from 2-3.5% for different BAM releases. http://genome.sph.umich.edu/wiki/VerifyBamID 5. Check for completeness of data against sequence index: a. All reads and runs for an individual should be included in corresponding BAM files b. All reads and runs from a BAM file should be from the same individual It is important to note that the 20130502 bam release was unusual as rather than all QC passed bams did not make it into the data/XXXXXX/alignment directories For the final round of analysis for the project we only used ILLUMINA sequence data which was 70bp or greater. The group decided for the final variant calling only samples with both low coverage and exome sequence would be considered. This means there were a small number of samples with just exome or just low coverage alignments. The sequence data is still included in the sequence.index file and while the alignments are not being used in our variant calling they are still available to be downloaded. ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/phase3_EX_or_LC_only_alignment/ These bam files have their own alignment index which you can find in this directory. D. bas files .bas files are bam statistic files, one line per readgroup, columns separated by tabs. The first line is a header that describes each column. The first 6 columns provide meta information about each readgroup. The remaining columns provide various statistics about the readgroup, calculated by going through the release bams. Where data isn't available to calculate the result for a column the default value will be 0. Each column is described in detail below. Column 1 'bam_filename': The DCC bam file name in which the readgroup data can be found. Column 2 'md5': The md5 checksum of the bam file named in column 1. Column 3 'study': The SRA study id this readgroup belongs to. Column 4 'sample': The sample (individual) identifier the readgroup came from. Column 5 'platform': The sequencing platform (technology) used to sequence the readgroup. Column 6 'library': The name of the library used for the readgroup. Column 7 'readgroup': The readgroup identifier. This is unique per .bas file. The remaining columns summarise data for reads with this RG tag in the bam file given in column 1. Column 8 '#_total_bases': The sum of the length of all reads in this readgroup. Column 9 '#_mapped_bases': The sum of the length of all reads in this readgroup that did not have flag 4 (== unmapped). Column 10 '#_total_reads': The total number of reads in this readgroup. Column 11 '#_mapped_reads': The total number of reads in this readgroup that did not have flag 4 (== unmapped). Column 12 '#_mapped_reads_paired_in_sequencing': As for column 10, but also requiring flag 1 (== reads paired in sequecing). Column 13 '#_mapped_reads_properly_paired': As for column 10, but also requiring flag 2 (== mapped in a proper pair, inferred during alignment). Column 14 '%_of_mismatched_bases': Calculated by summing the read lengths of all reads in this readgroup that have an NM tag, summing the edit distances obtained from the NM tags, and getting the percentage of the latter out of the former to 2 decimal places. Column 15 'average_quality_of_mapped_bases': The mean of all the base qualities of the bases counted for column 8, to 2 decimal places. Column 16 'mean_insert_size': The mean of all insert sizes (ISIZE field) greater than 0 for properly paired reads (as counted in column 12) and with a mapping quality (MAPQ field) greater than 0. Rounded to the nearest whole number. Column 17 'insert_size_sd': The standard deviation from the mean of insert sizes considered for column 15. To 2 decimal places. Column 18 'median_insert_size': The median insert size, using the same set of insert sizes considered for column 15. Column 19 'insert_size_median_absolute_deviation': The median absolute deviation of the column 17 data. Column 20 '#_duplicate_reads': The number of reads which were marked as duplicates Column 21' #_duplicate_bases': The number of bases which were narked as duplicated