Directory Contents DCC VerifyBamID results for 1401 exome bams based on 20110421 sequence.index plus VerifyBamID output files for each bam checked See http://genome.sph.umich.edu/wiki/VerifyBamID for how to interpret results. ======== files 20110614_1401_exome_bam_20110421_verifybam.results contains results from VerifyBamID .selfSM files for the all 1401 exome bams sorted in the following manner cat OUTPUT_FILES/*/*.selfSM | grep -v ^SEQ | cut -f 1,4,5,6,17,19,21,25,26,27,28 | sort -n -k 2 |cat -n These results are then split based on center and platform. 20110614_bc_illumina_20110421_verifybamid.results 688 20110614_bi_illumina_20110421_verifybam.results 530 20110614_bcm_solid_20110421_verifybam.results 188 ========= Directory OUTPUT_FILES contains all output files created by VerifyBamID for each bam checked in sample id labelled directories ======== EXOME_DATA contains the files required by VerifyBamID when using the --bfile option. eg. --bfile exome ======================================================================== Below is an outline of process used to create the files used by VerifyBamID and an example command line of how the program was run. The snps used for analysis can be found in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20110527_bi_omni_1525_v2_genotypes/ Only snps that passed all filters were used. zcat Omni25_genotypes_1525_samples_v2.b37.vcf.gz | egrep "^#|PASS" > passed_1525.vcf bgzip passed_1525.vcf tabix -p vcf passed_1525.vcf.gz I sorted the target list so CHR in numberical order. sort -k 1n,1 -k 2n,2 20110426_exome_add50bp.consensus.bed > sorted.target_list perl -e 'while(<>){chomp;@aa =split /\s+/; print "tabix ../passed_1525.vcf.gz $aa[0]:$aa[1]-$aa[2] >> exome_targetted.vcf\n";}' sorted.target_list > sort_get_exome_targets.sh head sort_get_exome_targets.sh tabix passed_1525.vcf.gz 1:69040-70058 >> exome_targetted.vcf tabix passed_1525.vcf.gz 1:861271-861443 >> exome_targetted.vcf ..... ./sort_get_exome_targets.sh Create plink formatted files. vcftools_0.1.5/cpp/vcftools --plink --vcf exome_targetted.vcf rename out exome out* plink/plink-1.07-x86_64/plink --file exome --maf 0.01 --geno 0.05 --make-bed @----------------------------------------------------------@ | PLINK! | v1.07 | 10/Aug/2009 | |----------------------------------------------------------| | (C) 2009 Shaun Purcell, GNU General Public License, v2 | |----------------------------------------------------------| | For documentation, citation & bug-report instructions: | | http://pngu.mgh.harvard.edu/purcell/plink/ | @----------------------------------------------------------@ Web-based version check ( --noweb to skip ) Connecting to web... Problem connecting to web Writing this text to log file [ plink.log ] Analysis started: Fri May 20 12:51:56 2011 Options in effect: --file exome --maf 0.01 --geno 0.05 --make-bed 69074 (of 69074) markers to be included from [ exome.map ] Warning, found 1525 individuals with ambiguous sex codes Writing list of these individuals to [ plink.nosex ] 1525 individuals read from [ exome.ped ] 0 individuals with nonmissing phenotypes Assuming a disease phenotype (1=unaff, 2=aff, 0=miss) Missing phenotype value is also -9 0 cases, 0 controls and 1525 missing 0 males, 0 females, and 1525 of unspecified sex Before frequency and genotyping pruning, there are 69074 SNPs 1525 founders and 0 non-founders found Total genotyping rate in remaining individuals is 0.997257 307 SNPs failed missingness test ( GENO > 0.05 ) 16060 SNPs failed frequency test ( MAF < 0.01 ) After frequency and genotyping pruning, there are 52757 SNPs After filtering, 0 cases, 0 controls and 1525 missing After filtering, 0 males, 0 females, and 1525 of unspecified sex Writing pedigree information to [ plink.fam ] Writing map (extended format) information to [ plink.bim ] Writing genotype bitfile to [ plink.bed ] Using (default) SNP-major mode Analysis finished: Fri May 20 12:53:24 2011 rename plink exome plink* example command line: verifyBamID --reference human_g1k_v37.fa --bfile exome --verbose -d 1500 --precise --in HG00181.mapped.ILLUMINA.BWA.FIN.exome.20110228.bam --out HG00181 Contact: Richard Smith DCC 1000 Genomes Project smithre@ebi.ac.uk or resequencing-informatics@ebi.ac.uk