This readme describes the alignment data available on the ftp site, how it is processed and what summary information is available for each alignment. A. File format and alignment index All alignment data is in the BAM format. This is the binary version of the SAM format which is described here: http://samtools.sourceforge.net/. Each BAM file has two associated files: a index file which has the same name but ends with .bai and a statistics file which has the same name but end with .bas. The format of bas file is described further down in in section D. The alignments are found under data/XXXXXXXX/alignment where the XXXXXXXX identifier is the sample name. There is an alignment.index which links the alignment bam files together with their matching bai file and bas file and gives md5sums for each. The columns are as follows: 1. bam file 2. MD5 3. bai index file 4. bai index file md5 5. bas statistics file 6. bas statistic file md5 Directory alignment_indices/ contains the most current alignment.index (the one with the latest date and identical to the alignment.index one level up), as well as alignment.index files of previous BAM releases. The directory also contains the following files about BAM statistics: yyyymmdd.alignment.index.bas.gz - a collective bas file of all BAM files. yyyymmdd_yyyymmdd.alignment_stats.csv - a summary statistics of BAMs in any two releases as specified by the dates in the file name; a comparison of the two releases is captured in the "diff" values of the file. The file contains the following information break down by platform. 1. mapped basepairs in Gb 2. # of individuals in the release 3. # of individuals with > 10 Gb of mapped sequences mapped basepairs in Gb is also shown as break down by population at the bottom of the file. 20091216.alignment_stats.csv - a summary statistics of the very first release of main project BAMs. The bam filenames themselves contain a lot of information, e.g: NA12878.chrom1.LS454.ssaha.CEU.high_coverage.20091216.bam The name can be broken down into 7 pieces: 1. Sample name: this matches column 10 in the sequence.index file and represents the individual all the sequences belong to. 2. Chromosome: this is the name of the chromosome that this file contains mappings to. To ensure no alignment file becomes too big we sometimes create bam files per chromosome, 1-22 and X, Y and MT. There is also a nonchrom file containing reads that mapped to non-chromosomal sequences in the reference. Finally there is an unmapped file which contains any unmapped singled ended reads and any pairs where neither read maps. If one half of a pair maps both reads can be found in the other bam files. If the original bam file was not very large, there may be no or only one chromosomal bams, and instead a 'mapped' bam that contains the opposite of the unmapped bam. In Phase I and onwards releases, bams are not split into chromosomes and are released as a whole genome alignment; these bams are "mapped" bams, complemented by the unmapped bam. A chrom20 BAM is also released. 3. Sequencing platform: this can be LS454, SOLID or ILLUMINA. 4. Mapping algorithm: the program used to run the mapping, currently this includes bwa, ssaha and bfast. 5. Population*: the Sample population 3 letter code, this code is defined in README.populations 6. Analysis group*: this describes the sequencing strategy used for the sequence being aligned, those strategies being 'low coverage', 'high coverage', 'exon targeted' and 'exome'. 7. Date in the format YYYYMMDD: this should match the date from the sequence.index file which was used to produce the alignments. This date will also be in the alignment index name. Over time alignment index files will start to contain files with multiple dates in them. When this situation arises the index file will always be named for the most recent date found. *Note: For historical reason, in the first release of BAM files, fields 5 and 6 are replaced with a single column of project name such as SRP000033. One example is NA20828.chrom7.ILLUMINA.bwa.SRP000033.20091216.bam B. Alignment Process The low coverage alignments were produced by Richard Durbin's group at the Sanger Institute (Illumina and LS454) and David Craig's group at The Translational Genomics Institute (Solid). The exome alignment data are produced by Durbin's group (Illumina) and Baylor College of Medicine (Solid). Different aligners are used for the 3 sequencing platforms, described below. Alignments are carried out on 'run level' fastqs, that is fastq files that share the same column 3 in the sequence.index (which becomes known as the 'readgroup' in the eventual released bam files). B1. Illumina Illumina data was aligned with bwa v0.5.9 in 4 steps: 1. Index the reference fasta: bwa index -a bwtsw $reference_fasta 2. For each fastq file: bwa aln -q 15 -f $sai_file $reference_fasta $fastq_file 3. Create SAM files using bwa sampe or samse for paired-end or unpaired reads respectively. For paired-end reads, the maximum insert size is taken to be 3 times the expected insert size. bwa sampe -a $max_insert_size -f $sam_file $reference_fasta $sai_files $fastq_files bwa samse -f $sam_file $reference_fasta $sai_file $fastq_file 4. Create BAM from SAM, sort, fix mate information and add the MD tag: samtools view -bSu $sam_file | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file B2. LS454 LS454 data was aligned with smalt v0.5.8 in 3 steps: 1. Index the reference fasta: smalt index -k 13 -s 4 $index_base $reference_fasta 2. Map the fastq files: smalt map -f samsoft -o $sam_file $index_base $fastq_file(s) 3. Create BAM from SAM, sort, fix mate information and add the MD tag: samtools view -bSu $sam_file -T $reference_fasta | samtools sort -n -o - samtools_nsort_tmp | samtools fixmate /dev/stdin /dev/stdout | samtools sort -o - samtools_csort_tmp | samtools fillmd -u - $reference_fasta > $fixed_bam_file B3. Solid Low coverage Solid data (50mer-50mer) were aligned with bfast 0.7.0a: 1. Merge into single fastq file based on SRR ID 2. bfast match –f hs37d5.fa –r SRRxxxxxx.fastq –A 1 –K 8 –M 384 –n 8 –Q 25000 –T /tmp/ -t > SRRxxxxxx.bmf 3. bfast localign –f hs37d5.fa –m SRRxxxxxx.bmf –A 1 –o 20 –n 8 –t > SRRxxxxxx.baf 4. bfast postprocess –f hs37d5.fa –i SRRxxxxxx.baf –A 1 –a 3 –Y 1 –r SRRxxxxxx.ReadGroup.txt –n 8 –Q 1000 –t > bfast.reported.file.SRRxxxxxx.sam 5. Java –Xmx2g –jar /bin/picard/SortSam.jar I=bfast.reported.file.SRRxxxxxx.sam O=bfast.reported.file.SRRxxxxxx.bam SO=coordinate TMP_DIR=/tmp/ VALIDATION_STRINGENCY=SILENT Low coverage Solid data (50mer-35mer) were aligned with bfast+bwa-0.7.0a: 1. Merge into single fastq file based on SRR ID 2. 50mers bfast match –f hs37d5.fa –r SRRxxxxxx*.fastq –A 1 –K 8 –M 384 –n 8 –Q 25000 –T /tmp/ -t > SRRxxxxxx*.bmf 35mers bfast bwaaln -c -t 4 –f SRRxxxxxx_2.bmf hs37d5.fa SRRxxxxxx_2.fastq 3. bfast localalign -f hs37d5.fa -m SRRxxxxxx.bmf –O SRRxxxxxx_1.bmf –T SRRxxxxxx_2.bmf -A 1 –o 20 –n 8 –t > SRRxxxxxx.baf 4. bfast postprocess -f hs37d5.fa -i SRRxxxxxx.baf -A 1 -a 3 -Y 0 -m 10 -M 28 -r SRRxxxxxx.ReadGroup.txt -n 4 -Q 1000 -t > bfast.reported.file.SRRxxxxxx.sam 5. Java –Xmx2g –jar /bin/picard/SortSam.jar I=bfast.reported.file.SRRxxxxxx.sam O=bfast.reported.file.SRRxxxxxx.bam SO=coordinate TMP_DIR=/tmp/ VALIDATION_STRINGENCY=SILENT Exome Solid data were aligned with bfast0.6.4d 1. merge into single fastq file based on SRR ID 2. bfast match -A 1 -t -n 8 -f hs37d5.fa -r $merged_fastq_file> $match_file 3. bfastlocalalign -A 1 -t -n 8 -f hs37d5.fa -m $match_file> $align_file 4. bfastpostprocess -f hs37d5.fa -i $align_file -a 3 -O 1 > $sam_file 5. java -Xmx22000M –jar SamFormatConverter.jar TMP_DIR=/space1/tmp/ INPUT=$sam_file OUTPUT=$bam_file VALIDATION_STRINGENCY=STRICT 6. java -Xmx22000M –jarSortSam.jar TMP_DIR=/space1/tmp/ INPUT=$bam_file OUTPUT=$sorted_bam_file SORT_ORDER=coordinate VALIDATION_STRINGENCY=STRICT 7. java -Xmx22000M –jarMarkDuplicates.jar TMP_DIR=/space1/tmp/ INPUT=$sorted_bam_file OUTPUT=$dups_bam_file METRICS_FILE='./metric_file.picard' VERBOSITY=ERROR VALIDATION_STRINGENCY=STRICT B4. Bam Improvement The run-level alignment BAMs are improved in various ways to help increase the quality and speed of subsequent SNP calling that may be carried out on them. For Illumina BAMs the following improvements were performed: 1. Reads undergo local realignment around known Indels using GATK IndelRealigner. java $jvm_args -jar GenomeAnalysisTK.jar -T RealignerTargetCreator -R $reference_fasta -o $intervals_file -known $known_indels_file(s) java $jvm_args -jar GenomeAnalysisTK.jar -T IndelRealigner -R $reference_fasta -I $bam_file -o $realigned_bam_file -targetIntervals $intervals_file -known $known_indels_file(s) -LOD 0.4 -model KNOWNS_ONLY -compress 0 --disable_bam_indexing where the known Indels are from the following: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.indels_mills_devine_hg19_leftAligned_collapsed_double_hit.indels.sites.vcf.gz ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.low_coverage_vqsr.20101123.indels.sites.vcf.gz 2. Realigned BAMs have their read qualities recalibrated with GATK CountCovariates and TableRecalibration. java $jvm_args -jar GenomeAnalysisTK.jar -T CountCovariates -R $reference_fasta -I $realigned_bam_file -recalFile recal_data.csv -knownSites $known_sites_file(s) -l INFO -L '1;2;3;4;5;6;7;8;9;10;11;12;13;14;15;16;17;18;19;20;21;22;X;Y;MT' -cov ReadGroupCovariate -cov QualityScoreCovariate -cov CycleCovariate -cov DinucCovariate java $jvm_args -jar GenomeAnalysisTK.jar -T TableRecalibration -R $reference_fasta -recalFile recal_data.csv -I $realigned_bam_file -o $recalibrated_bam_file -l INFO -compress 0 --disable_bam_indexing where the known sites for recalibration are from dbSNP135: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/phase2_mapping_resources/ALL.wgs.dbsnp.build135.snps.sites.vcf.gz 3. samtools calmd is applied to the recalibrated BAMs, which fixes the NM tags and introduces BQ tags which can be used during SNP calling. samtools calmd -Erb $recalibrated_bam_file $reference_fasta > $bq_bam_file For Exome Solid BAMs, only step 2 was performed. For low coverage Solid BAMs, only steps 1 and 2 were performed. B5. Release bam file production The improved BAMs are merged together to get the release BAM files available for download. Release BAM files therefore contain reads from multiple readgroups. Production proceeds broadly as follows: 1. Run-level BAMs have extraneous tags (OQ, XM, XG, XO) stripped from them, to reduce total file size by around 30%. 2. Tag-stripped BAMs are merged to the library level with Picard. (merging to a given level means to create a single BAM file containing all the readgroups that share a given member of that level; in this case it means a BAM will be made for each library, each library BAM containing all the reads from the run-level BAMs associated with that library). java $jvm_args -jar MergeSamFiles.jar INPUT=$tag_stripped_bam_file(s) OUTPUT=$library_level_bam VALIDATION_STRINGENCY=SILENT 3. PCR duplicates are marked in library-level BAMs using Picard MarkDuplicates. java $jvm_args -jar MarkDuplicates.jar INPUT=$library_level_bam OUTPUT=$markdup_bam_file ASSUME_SORTED=TRUE METRICS_FILE=/dev/null VALIDATION_STRINGENCY=SILENT 4. Duplicate-marked library-level BAMs are merged to the platform level. java $jvm_args -jar MergeSamFiles.jar INPUT=$markdup_bam_file(s) OUTPUT=$platform_level_bam VALIDATION_STRINGENCY=SILENT 5. Platform-level BAMs are split into chromosomes (and other, see above) BAMs, which are the release files. C. QA of the alignment data by DCC Platform-level bams for individuals are checked by a QA process by DCC before they are released to the ftp site. The QA mainly check for md5 for each file to make sure the file was not corrupted during transfer and the completeness of data: 1. All reads and runs for an individual should be included in corresponding BAM files 2. All reads and runs from a BAM file should be from the same individual D. bas files .bas files are bam statistic files, one line per readgroup, columns separated by tabs. The first line is a header that describes each column. The first 6 columns provide meta information about each readgroup. The remaining columns provide various statistics about the readgroup, calculated by going through the release bams. Where data isn't available to calculate the result for a column the default value will be 0. Each column is described in detail below. Column 1 'bam_filename': The DCC bam file name in which the readgroup data can be found. Column 2 'md5': The md5 checksum of the bam file named in column 1. Column 3 'study': The SRA study id this readgroup belongs to. Column 4 'sample': The sample (individual) identifier the readgroup came from. Column 5 'platform': The sequencing platform (technology) used to sequence the readgroup. Column 6 'library': The name of the library used for the readgroup. Column 7 'readgroup': The readgroup identifier. This is unique per .bas file. The remaining columns summarise data for reads with this RG tag in the bam file given in column 1. Column 8 '#_total_bases': The sum of the length of all reads in this readgroup. Column 9 '#_mapped_bases': The sum of the length of all reads in this readgroup that did not have flag 4 (== unmapped). Column 10 '#_total_reads': The total number of reads in this readgroup. Column 11 '#_mapped_reads': The total number of reads in this readgroup that did not have flag 4 (== unmapped). Column 12 '#_mapped_reads_paired_in_sequencing': As for column 10, but also requiring flag 1 (== reads paired in sequecing). Column 13 '#_mapped_reads_properly_paired': As for column 10, but also requiring flag 2 (== mapped in a proper pair, inferred during alignment). Column 14 '%_of_mismatched_bases': Calculated by summing the read lengths of all reads in this readgroup that have an NM tag, summing the edit distances obtained from the NM tags, and getting the percentage of the latter out of the former to 2 decimal places. Column 15 'average_quality_of_mapped_bases': The mean of all the base qualities of the bases counted for column 8, to 2 decimal places. Column 16 'mean_insert_size': The mean of all insert sizes (ISIZE field) greater than 0 for properly paired reads (as counted in column 12) and with a mapping quality (MAPQ field) greater than 0. Rounded to the nearest whole number. Column 17 'insert_size_sd': The standard deviation from the mean of insert sizes considered for column 15. To 2 decimal places. Column 18 'median_insert_size': The median insert size, using the same set of insert sizes considered for column 15. Column 19 'insert_size_median_absolute_deviation': The median absolute deviation of the column 17 data. Column 20 '#_duplicate_reads': The number of reads which were marked as duplicates Column 21' #_duplicate_bases': The number of bases which were narked as duplicated